next up previous contents
Next: GC Content and Common Up: Hexamer Dissimilarity Comparisons Previous: Calibration and Confidence Curves   Contents

Comparison Sequences

Comparison curves indicate dissimilarity test results as normalized, cumulative distributions, rather than as individual values. Evaluating where a comparison curve intersects some threshold value of t indicates the proportion of sequences in a library likely to have originated from taxon A, as described in Section 2.4. Figure 4.3A summarizes calibration curves, confidence curves, and comparison curves for comparisons between fungi and plants. An unexpected outcome is the finding that the axenic plant library (Mt Long) contains a lower portion of putative plant transcripts than any of the three libraries prepared from G. intraradices tissues. Even if we restrict this to a conservative value for t, such as 400, a greater proportion of transcripts having hexamer compositions resembling those of plants is present among libraries from G. intraradices than in the library prepared from axenic M. truncatula root hairs. The extent of the disparity is quantified below.

Comparison results between fungi and rhizobacteria are shown in Figure 4.3B. Only one library contains a marginally significant proportion of sequences that resemble rhizobacteria, the Lammers library from germinating G. intraradices spores (black line), though none of these are significant at P < 0.05.

In contrast to looking at overall properties, the characteristics of an individual sequence may also be of interest. One perspective is to examine individual sequences that can be identified as more closely resembling the hexamer compositions of fungal, plant, or rhizobacterial training sets (Figure 4.4). By extension, one may visualize the hexamer dissimilarity test results of sequences from several different libraries at once. Transcripts more like fungal training sets should appear in the lower-left quadrant, because t < 0 when compared both with plants and rhizobacteria. By the same reasoning, plant-like sequences should appear in the lower-right quadrant ($t_1>0$; $t_2<0$), and sequences resembling rhizobacteria should appear in the upper-left quadrant ($t_1<0$;$t_2>0$).

Figure 4.4: Paired dissimilarity test values summarize results from two pairwise comparisons, between fungi and plants, $D(A)-D(B_1)$, and between fungi and rhizobacteria, $D(A)-D(B_2)$. Each point corresponds to a single transcript from one of four libraries (open circles) or from validation gene sequences from either mycorrhizal fungi (Glomus) or plant (Medicago). Libraries are described in Table 4.1; individual genes are described in Table 4.3 as validation sequences.

On inspection of the results plotted in Figure 4.4, sequences from Medicago truncatula generally lie in the lower-right quadrant (solid green circles). Shorter sequences fall near the origin, while longer sequences are further from the origin, because t generally increases with sequence length. Sequences from Glomus spp. (solid magenta circles) generally lie in the lower-left quadrant, with exceptions as noted for Table 4.3 and one sequence that has a hexamer composition more closely resembling rhizobacteria than fungi. This is a large (L > 1500 nt) fungal homeobox gene (accession AF110198), which is rich in GC content (56.1%).

Consistent with what was seen in Figure 4.3, the plant library has a greater proportion of putative fungal sequences than the fungal libraries. This is apparent from the large number of transcripts from the Long root-hair enriched library that appear in the lower-left quadrant (open green circles), relative to fungal libraries, which lie mostly in the plant quadrant.

In light of the confidence calculations described above, none of the plant or fungal transcripts that resemble rhizobacteria more closely than fungi should be considered significantly non-fungal (P > 0.05). However, in the case of fungal-plant comparisons, a non-trivial proportion of both plant and fungal transcripts appear to resemble plants strongly enough to reject the null hypothesis (P < 0.05). For a critical test value of t=312, 11% of 899 transcripts in the Long plant library, 10% of 363 transcripts in the Harrison library, 25% of 182 transcripts in the Lammers library, and 10% of 165 transcripts in the Sawaki library have hexamer compositions that significantly resemble plants, when compared with fungi.

To control for the effect that t increases with longer sequences, one can readily rescale t as $t/\sqrt{L}$ and plot the transformed data (Figure 4.5). In addition to transformed data from the previous figure, values obtained from comparing hypothetical, repeated-hexamer sequences of increasing length, from 64 to 1024 nt, are also shown. This illustrates the influence of single hexamer instances on t. Sequences consisting only of the hexamers AAAAAA, TTTTTT, and GAGAGA strongly resemble plant sequences, lying in the lower-right quadrant, and sequences consisting of GCGCGC strongly resemble rhizobacteria. The degree of resemblance increases with longer sequences, but none changed quadrants.

Figure 4.5: Length-scaled paired dissimilarity test results. As in Figure 4.4, paired dissimilarity test values summarize results from two pairwise comparisons: fungi and plants, $D(A)-D(B_1)$, and fungi and rhizobacteria, $D(A)-D(B_2)$. Here, dissimilarity values are rescaled by dividing by the square-root of the length of a sequence. In addition to transcripts from four libraries and the 26 validation sequences, data from hypothetical, simple hexamer repeat sequences of length 64 ($L=2^6$), 128, 256, 512, and 1024 ($L=2^{10}$) nt are shown as examples, connected by dotted lines.


next up previous contents
Next: GC Content and Common Up: Hexamer Dissimilarity Comparisons Previous: Calibration and Confidence Curves   Contents
Peter T. Hraber 2001-06-13