next up previous contents
Next: Statistical Confidence Up: Data Analysis Previous: Training Sets   Contents

Hexamer Dissimilarity Comparisons

A lexical analysis compared composition of training sequences that represent taxa of potential origin, measured as dissimilarity. As with distance metrics, low dissimilarity implies similar composition. Denoting $D$ as dissimilarity and $t$ as Dunning's likelihood-ratio test statistic of hexamer dissimilarity [40,128], we computed the difference of two dissimilarity values twice, as $t_1=D(A)-D(B_1)$ and $t_2=D(A)-D(B_2)$, and compared the results. The null hypothesis was that $t \le 0$, or that a transcript resembles more closely the hexamer composition of $A$ (fungi) than of $B$ (plants or rhizobacteria). See Section 2.3.5 and [128] for explications of this comparative approach.

Prior to performing comparisons, we used a resampling technique [128] to evaluate the degree of overlap between training sets. Calibration curves were obtained from 100 resampled replicates in which each training set was randomly halved, and one half was used to establish hexamer counts, while the other half was used to compute $t$. The degree of overlap in the tails of calibration curves at $t = 0$ gives experiment-wide false positive and false negative rates.

To facilitate comparing many sequences of different lengths, a transformation of dissimilarity test values was obtained by dividing $t$ by $\sqrt{L}$, where $L$ is the length of a sequence. In general, test values were computed directly from full-length sequences. To evaluate whether subsampling of a sequence yielded different dissimilarity test results, we calculated $t$ by resampling. When resampling, 40 replicate subsequences of length 300 were randomly chosen and compared with random subsets of training sets $A$ and $B$. If a sequence was not longer than 300 nt, the entire sequence was compared with random subsets of $A$ and $B$. Test results from resampling are summarized as the mean and standard error of $t$.


next up previous contents
Next: Statistical Confidence Up: Data Analysis Previous: Training Sets   Contents
Peter T. Hraber 2001-06-13