next up previous contents
Next: Results Up: Methods Previous: Base Composition   Contents


Hexamer Dissimilarity

White et al. [128] used a likelihood-ratio test to determine whether tuple frequencies from a particular sequence more closely resemble the frequency distribution of control data sets from the taxon being sequenced or a distantly related outgroup. They computed a test statistic $t(A,B,x)$ for each sequence $x$ as the difference of log-likelihood ratio dissimilarity measures, $D(A,x) = -2\log \lambda
(A,x)$, for two data sets, a control set $A$ and an outgroup $B$, such that $t(A,B,x) = D(A,x) - D(B,x)$. A negative value for $t$ indicates that the sequence more closely resembles tuples from $A$; conversely, a positive value indicates a likely contaminant related to $B$. (Dissimilarity is conceptually similar to distance, but dissimilarity does not measure distance because it does not possess the mathematical properties of a distance metric [125].) Unlike the calculation of calibration curves, in which 300 nt subsequences are randomly resampled, hexamer dissimilarity is measured over the whole length of a test sequence when inferring a transcript's origin. Originally, the investigators used the null hypothesis that no difference exists for dissimilarity measures between the two data sets, or that $t(A,B,x)=0$ [128]. White et al. [128] tested two alternative hypotheses: that $t<0$, being more like $A$, or $t>0$, like $B$.

Though we used White's word-counting methods, we did make slight modifications. We simplified one program (called hybridize) to compute individual dissimilarity values, rather than paired differences; a patch that details how to modify the C program is available as supplementary material. More importantly, we amended the null hypothesis and interpreted calibration curves to test for statistically significant dissimilarity differences. Though the likelihood-ratio test statistic indicates the magnitude of similarity to $A$ or $B$, we do not know what values for $t$ are significant with known confidence. When testing hypotheses, one can make two types of error: type I, or false positives, and type II, false negatives [47]. The false positive rate is denoted $\alpha$ and false negative rate $\beta$. We determine $\alpha$ and $\beta$ from overlap in the calibration curves. Inferring error rates from calibration curves is justified because we know the correct answer and determine the error rate via resampling, as with bootstrap methods to infer error rates [30].

We are interested in knowing from which of two organisms a sequence originated, and are reasonably confident that it came from either one or the other. Thus, we assume it came from one and test whether we have evidence to refute this assumption. The null hypothesis here is that sequence $x$ is from $A$. Alternatively, it might be from $B$. Evaluating the calibration curve overlap at $t = 0$ quantifies the associated error rates. The cumulative distribution function ($cdf$) of taxon $B$ specifies $\beta$ where $cdf_B$ intersects 0; the $cdf$ from $A$ specifies $\alpha$ as $1 - cdf_A(0)$. We can thus resolve the problem with known accuracy $P$: $P(t > 0) = \alpha$.


next up previous contents
Next: Results Up: Methods Previous: Base Composition   Contents
Peter T. Hraber 2001-06-13