White et al. [128] used a likelihood-ratio test to determine
whether tuple frequencies from a particular sequence more closely
resemble the frequency distribution of control data sets from the
taxon being sequenced or a distantly related outgroup. They computed
a test statistic
for each sequence
as the difference of
log-likelihood ratio dissimilarity measures,
, for two data sets, a control set
and an outgroup
, such
that
. A negative value for
indicates
that the sequence more closely resembles tuples from
; conversely,
a positive value indicates a likely contaminant related to
.
(Dissimilarity is conceptually similar to distance, but dissimilarity
does not measure distance because it does not possess the mathematical
properties of a distance metric [125].) Unlike the
calculation of calibration curves, in which 300 nt subsequences are
randomly resampled, hexamer dissimilarity is measured over the whole
length of a test sequence when inferring a transcript's origin.
Originally, the investigators used the null hypothesis that no
difference exists for dissimilarity measures between the two data
sets, or that
[128]. White et al.
[128] tested two alternative hypotheses: that
, being
more like
, or
, like
.
Though we used White's word-counting methods, we did make slight
modifications. We simplified one program (called hybridize) to
compute individual dissimilarity values, rather than paired
differences; a patch that details how to modify the C program is
available as supplementary material.
More importantly, we amended the null hypothesis and interpreted
calibration curves to test for statistically significant dissimilarity
differences. Though the likelihood-ratio test statistic indicates the
magnitude of similarity to
or
, we do not know what values for
are significant with known confidence. When testing hypotheses, one
can make two types of error: type I, or false positives, and type II,
false negatives [47]. The false positive rate is denoted
and false negative rate
. We determine
and
from overlap in the calibration curves. Inferring error rates
from calibration curves is justified because we know the correct
answer and determine the error rate via resampling, as with bootstrap
methods to infer error rates [30].
We are interested in knowing from which of two organisms a sequence
originated, and are reasonably confident that it came from either one
or the other. Thus, we assume it came from one and test whether we
have evidence to refute this assumption. The null hypothesis here is
that sequence
is from
. Alternatively, it might be from
.
Evaluating the calibration curve overlap at
quantifies the
associated error rates. The cumulative distribution function (
)
of taxon
specifies
where
intersects 0; the
from
specifies
as
. We can thus resolve the
problem with known accuracy
:
.