A lexical analysis compared composition of training sequences that
represent taxa of potential origin, measured as dissimilarity. As
with distance metrics, low dissimilarity implies similar composition.
Denoting
as dissimilarity and
as Dunning's likelihood-ratio
test statistic of hexamer dissimilarity [40,128], we
computed the difference of two dissimilarity values twice, as
and
, and compared the results.
The null hypothesis was that
, or that a transcript resembles
more closely the hexamer composition of
(fungi) than of
(plants or rhizobacteria). See
Section 2.3.5 and [128] for
explications of this comparative approach.
Prior to performing comparisons, we used a resampling technique
[128] to evaluate the degree of overlap between training
sets. Calibration curves were obtained from 100 resampled replicates
in which each training set was randomly halved, and one half was used
to establish hexamer counts, while the other half was used to compute
. The degree of overlap in the tails of calibration curves at
gives experiment-wide false positive and false negative rates.
To facilitate comparing many sequences of different lengths, a
transformation of dissimilarity test values was obtained by dividing
by
, where
is the length of a sequence. In general,
test values were computed directly from full-length sequences. To
evaluate whether subsampling of a sequence yielded different
dissimilarity test results, we calculated
by resampling. When
resampling, 40 replicate subsequences of length 300 were randomly
chosen and compared with random subsets of training sets
and
.
If a sequence was not longer than 300 nt, the entire sequence was
compared with random subsets of
and
. Test results from
resampling are summarized as the mean and standard error of
.