next up previous contents
Next: Calibration and Confidence Curves Up: Hexamer Dissimilarity Comparisons Previous: Hexamer Dissimilarity Comparisons   Contents

Validation Sequences

Results of analyzing 14 fungal and 12 plant genes are tabulated in Table 4.3 and illustrated in Figure 4.1. Hexamer dissimilarity comparisons correctly identify the origin of a sequence in 10 of 14 fungal sequences, and 10 of 12 plant sequences. The four fungal sequences whose origins are incorrectly identified have lower GC content (< 40%) than other fungal sequences. Similarly, the two plant sequences having hexamer composition resembling that of fungi have greater GC content than other plant sequences (> 47%). Other factors, namely the length of the sequence analyzed and whether or not the molecule isolated was an mRNA, did not vary consistently with those sequences mistakenly identified.

Similar experiments performed with more inclusive training sets, which did incorporate mRNA sequences and longer poly-A and poly-T regions, yielded inferences qualitatively similar to those described here, but with more overlap between calibration curves, and thus larger error rates (not shown).

Resampling t from randomly chosen subsequences is most highly variable for long sequences. In most cases, resampling does not change the inference drawn about which taxon a sequence resembles more closely, because the sign for t measured directly agrees with the sign of the average value for t when resampled (Table 4.3). This indicates that most sequences are homogenous in their hexamer composition relative to plant and fungal training sets, and that potential chimeras of transcripts from two different species or concatamers of vector and insert sequences are rare or could not be detected, based on their hexamer composition.

One conspicuous exception to this observation is the phosphate transporter cloned from G. versiforme (accession U38650, Table 4.3). This sequence strongly resembles the plant training set in hexamer composition. Random sampling of 300 nt subsequences yields fragments that are less clearly plant-like. It closely resembles the plant phosphate transporter MtPT1 (accession AF000354) in hexamer composition and variability when resampled. Whether other fungal phosphate transporter proteins exhibit similarity to plant phosphate transporters in the same manner should be a subject of further inquiry.

Three other exceptions warrant comment: an mRNA sequence coding for the gene MYC1 in G. intraradices (accession AF110196), and two sequences in M. truncatula, an mRNA encoding a chitinase (accession Y10373) and a genomic DNA sequence encoding a putative chitinase (accession AF167327). Though t measured directly and by resampling have the correct sign, the absolute value of the standard error is greater than unity ($s>\vert\bar{t}\vert$) obtained from resampling (Table 4.3), which indicates great variability of t: some subsequences are more plant-like in hexamer composition, while other subsequences are more like fungi. However, the majority of plant chitinases analyzed do not exhibit this property.


Table 4.3: Validation Sequences: Dissimilarity comparison results for fungal (Glomus) and plant (Medicago) sequences. For each sequence, the length and percent GC content are shown, as are the directly measured dissimilarity difference $t=D(A)-D(B_1)$, and the mean ($\bar{t}$) and standard error ($s_{\bar{t}}$) from 40 resampled replicates, in which a subsequence of up to 300 nt was randomly chosen and compared with training sets from fungi and plants (Table 4.2), the sequence accession number, an indication of whether or not the molecule isolated was mRNA (if not, then it was DNA), and the name of the associated gene (or protein). Incorrect inferences, as determined by the sign of $t$, are underlined; cases where $\vert s_{\bar{t}}\vert>1$ are in bold.
$L$ %GC $t$ $\bar{t}$ $s_{\bar{t}}$ ACCESSION mRNA GENE NAME
(nt) direct ?
Glomus intraradices
1532 56.1 -1146.4 -250.1 -0.15 AF110198 Y homeobox protein HB1
858 38.8 382.0 132.5 0.28 AF110197 Y MYC2
1453 46.3 -67.1 -28.6 -1.49 AF110196 Y MYC1
610 34.3 357.9 206.7 0.17 L77908 N chitin synthase
617 54.8 -544.6 -365.6 -0.11 AF260996 N chitin synthase, isolate GiBCHS1
614 54.7 -530.1 -362.8 -0.11 AF260993 N chitin synthase, isolate GiCHS3
617 50.7 -396.1 -267.4 -0.10 AF260986 N chitin synthase, isolate GiCHS2
617 50.6 -386.6 -261.8 -0.10 AF260985 N chitin synthase, isolate GiBCHS2
617 50.6 -386.6 -267.7 -0.09 AF260983 N chitin synthase, isolate GiVCHS2
617 50.6 -386.6 -278.0 -0.08 AF260982 N chitin synthase, isolate GiWCHS2
Glomus versiforme
4116 52.8 -2452.2 -194.8 -0.35 AJ009630 N chitin synthase, clone Gvchs3
481 49.7 -282.0 -222.8 -0.22 AJ009629 N chitin synthase, clone Gvchs2
638 49.1 -132.8 -87.2 -0.33 AJ009628 N chitin synthase, clone Gvchs1
1833 37.2 783.7 89.3 0.68 U38650 Y phosphate transporter
Medicago truncatula
1867 40.3 766.6 104.9 0.30 AF000355 Y phosphate transporter MtPT2
1920 39.6 792.0 88.3 0.61 AF000354 Y phosphate transporter MtPT1
954 34.0 648.3 168.6 0.29 AF055921 N Mt4
1305 44.1 255.4 20.7 2.98 Y10373 Y chitinase
181 47.0 85.5 76.5 0.14 AF167329 N chitinase, clone T130008g
265 42.6 64.1 42.4 0.23 AF167328 N chitinase, clone T130007g
188 43.6 11.9 -6.2 -1.45 AF167327 N chitinase, clone T130006g
188 45.7 43.8 26.2 0.27 AF167326 N chitinase, clone T130005g
191 45.6 16.3 28.1 0.27 AF167325 N chitinase, clone T130004g
197 40.1 115.0 124.8 0.08 AF167324 N chitinase, clone T130003g
260 50.0 -27.2 -38.8 -0.28 AF167323 N chitinase, clone T130002g
245 47.4 -17.2 -21.4 -0.50 AF167322 N chitinase, clone T130001g

Figure 4.1: Paired dissimilarity test values summarize results from two pairwise comparisons, between fungi and plants, $D(A)-D(B_1)$, and between fungi and rhizobacteria, $D(A)-D(B_2)$. Each point corresponds to a single sequence from either mycorrhizal fungi (Glomus) or plant (Medicago). Sequence accession numbers are provided in Table 4.3.


next up previous contents
Next: Calibration and Confidence Curves Up: Hexamer Dissimilarity Comparisons Previous: Hexamer Dissimilarity Comparisons   Contents
Peter T. Hraber 2001-06-13