next up previous contents
Next: Discussion Up: On the Species of Previous: Hexamer Dissimilarity   Contents


Results

Validation sequence accession numbers, gene names, and comparison results appear in Table 2.3. Incorrect inferences are underlined. The word-counting method was generally quite reliable when tested against sequences of known origin, being wrong in 3 cases of 50. This indicates a failure rate of 6%, all false negatives under the null hypothesis that a transcript originates from the plant host. Performance of the method was not influenced by whether the isolated source of a sequence was an mRNA or DNA molecule, as indicated by the column labelled ``mRNA?''.


Table 2.3: Dissimilarity (D) comparison results from fifty validation sequences.
ACCESSION GENE NAME mRNA L $D(A)$ $D(B_1)$ $D(B_2)$
? (nt) PLANTS STRAMEN. BACTERIA
Glomus versiforme
AJ009628 chitin synthase Gvchs1 N 638 2535.2 2468.6 2718.4
AJ009629 chitin synthase Gvchs2 N 481 2203.2 2050.0 2286.0
AJ009630 chitin synthase Gvchs3 N 4116 7205.9 5235.8 5985.8
U38650 phosphate transporter Y 1833 3937.9 5702.3 6514.3
Glycine max
J01297 actin SAc3 N 1620 3322.0 4554.6 5329.7
K00821 lectin Le1 N 2152 4124.6 6558.3 7928.3
M64267 iron superoxide dismutase Y 1056 2773.6 3761.2 4269.2
Medicago truncatula
AF000354 phosphate transporter MtPT1 Y 1920 3800.3 5630.7 6654.2
AF000355 phosphate transporter MtPT2 Y 1867 3673.9 5390.1 6424.0
AF055921 Mt4 genomic sequence N 954 2631.9 4004.4 4539.1
AF106929 cell wall protein AM1 Y 885 3433.6 4200.0 4774.3
AF106930 translation initiation protein AM3-1 Y 3154 4557.6 5982.7 7212.8
AF106931 translation initiation protein AM3-2 Y 1384 3371.1 4130.0 4644.4
AJ132891 ha1 gene, exons 1-22 N 3620 4383.2 8683.6 10730.7
AJ388847 MtNo213 superoxide dismutase Y 530 2110.2 2219.8 2367.0
AJ388865 MtNo233 triosephosphate isomerase Y 563 2171.6 2405.6 2618.6
U16727 peroxidase precursor rip1 N 2603 4246.1 8210.0 9901.9
U38651 sugar transporter Y 1728 3619.6 5128.4 5976.5
X57732 leghemoglobin Mtlb1 N 1073 3021.3 5029.1 5845.9
X57733 leghemoglobin Mtlb2 N 592 2045.9 3156.0 3568.2
X60386 lectin lec1 N 1363 3228.8 4935.4 5605.6
X60387 lectin lec2 N 1192 3142.8 4472.6 4985.9
X82216 lec3 N 1155 2928.4 4283.3 4930.8
X68032 ENOD12 N 772 2780.4 3679.7 4096.5
X99466 ENOD16 N 1142 3124.6 4535.5 5156.2
X99467 ENOD20 N 1405 4003.6 5294.7 5966.7
Y10267 glutamine synthetase Y 1413 3116.1 4506.5 5292.1
Y10373 chitinase Y 1305 3369.5 4090.4 4703.4
Phytophthora infestans
AF004951 surface glycoprotein elicitor, inf2A Y 648 3428.4 2421.9 2589.1
AF004952 surface glycoprotein elicitor, inf2B Y 701 3611.7 2514.5 2698.6
L23938 ipiO2 N 1556 4125.2 4339.5 4855.5
L23939 ipiO1 N 1826 4360.5 4580.9 5259.0
L24206 ipiB1 N 1726 6086.7 4584.3 5159.3
M59715 actin actA N 1736 5137.1 3637.2 4420.0
M59716 actin actB N 1405 4425.3 3569.5 4141.6
M83535 calmodulin calA N 1358 4063.0 3724.9 4138.1
X64537 tigA N 2448 6221.0 4193.8 5181.9
P. capsici
U42304 chitin synthase chs N 449 2238.8 1882.7 1997.5
P. parasitica
X97205 cellulose binding-elicitor lectin Y 918 3819.1 2876.0 3208.4
Sinorhizobium meliloti
AF040724 nodD N 1776 5317.9 4197.8 4179.4
AF110770 superoxide dismutase sodA N 1196 4898.2 3343.2 2916.0
M61753 exoD N 858 4071.8 2847.2 2372.9
M68858 nodulation protein nodP & nodQ N 3476 9992.3 5288.0 3954.7
M96261 phosphate regulators, phoU & phoB N 1178 5332.7 3359.3 2866.4
U90221 syrA N 1102 4176.2 3375.8 3220.8
X01649 nodA, nodB, & nodC N 3373 7684.1 4819.5 4646.1
X03065 regulatory nitrogen fixation fixD N 2111 5723.0 4249.8 4228.3
X17523 glutamine synthetase II N 990 4303.5 2959.9 2720.4
Y08500 putA N 3804 13623.3 6376.5 4212.0
Agrobacterium tumefaciens
U91632 sugar transporter gguA & membrane-
spanning permeases gguB & gguC N 4185 11132.6 5959.4 4551.4

Table 2.4 summarizes the percent GC content and hexamer composition among the ten most abundant hexamers for training sequences. Plants and fungi (pooled Zygomycetes and Chytridiomycetes) have below 50% mean GC composition, their means differing by less than 3%, and have large standard deviations (5-8%). Sequences from Stramenopiles (pooled with P. infestans ESTs) and from rhizobacteria have greater than 50% mean GC content. Thus, fungi and plants are least distinguishable based on GC content, whereas Stramenopiles and rhizobacteria are more readily distinguished from plants. Differences in GC content between taxa are also reflected in the common hexamers. Abundance of the poly-A and poly-T hexamers among plants and fungi is clearly evident, but less apparent among Stramenopiles and rhizobacteria (Table 2.4).


Table 2.4: Percent GC composition and ten most common hexamers in training sequences. For each library, mean ($\bar{x}$) and standard deviation (s) of percent GC content, and ten most abundant hexamers (by rank) in each training set, with percentage of all hexamers represented by that hexamer, are shown.
PLANTS STRAMENOPILES FUNGI RHIZOBIA
$\bar{x}$ s $\bar{x}$ s $\bar{x}$ s $\bar{x}$ s
% GC 40.7 5.18 55.8 4.24 43.6 8.47 60.0 3.57
RANK
1 0.250 AAAAAA 0.126 AAGAAG 0.257 AAAAAA 0.159 GCCGGC
2 0.182 TTTTTT 0.113 CAAGGA 0.184 TTTTTT 0.159 CGCCGC
3 0.118 TTTGTT 0.111 CAAGAA 0.114 CAAGAA 0.157 GGCGGC
4 0.117 ATTTTT 0.108 AAAAAA 0.110 GGTGGT 0.157 GCGGCG
5 0.117 TATTTT 0.105 GCTGCT 0.109 AATAAA 0.156 CGGCGC
6 0.111 TTGTTT 0.098 CTGCTG 0.108 AAGAAG 0.150 CGGCGA
7 0.110 ATATAT 0.096 AAGGAG 0.106 AAAGAA 0.147 TCGGCG
8 0.109 TTATTT 0.096 GCCAAG 0.104 AAAAAT 0.147 TCGCCG
9 0.109 GAAGAA 0.094 TGCTGC 0.104 AAATAA 0.141 GCGCCG
10 0.108 AAAAAT 0.092 GAGGAG 0.100 TGGTGG 0.141 CCGGCG

Distributions of GC content are approximately normal in two of three cases studied, those of axenic P. sojae cultures (Figure 2.1). For sequences from infected plant cultures, a bimodal distribution is apparent. Roughly 25% from a total of 927 infected G. max sequences contain less than 50% GC; most of these are likely plant transcripts [93]. This is considerably greater than in axenic P. sojae cultures, in which fewer than 5% of mycelia and zoospore isolates contain less than 50% GC.

Figure 2.1: Distribution of GC base composition in two axenic Phytophthora sojae mycelia (blue) and zoospore (cyan) cDNA libraries and infected Glycine max (green). (A) Probability densities for histogram bin sizes of 0.02 (2%) in base composition. (B) Cumulative probability distributions.
\begin{figure}\begin{center}
\leavevmode
\epsfig{file=origin/figures/Fig1.eps}\end{center}\end{figure}

Several properties of cumulative distribution functions warrant comment, to help explain similar plots from word dissimilarity comparisons (Figures 2.1B and 2.2A). The median of a distribution occurs where the function reaches a cumulative probability of 0.5. Medians from all three P. sojae libraries are similar, varying by less than 4% GC (Figure 2.1B). Other moments of the distributions are readily apparent; the variance is inversely related to the slope at the median value of the function. A useful property of cumulative distribution functions is that any point on the y axis gives the integrated area (cumulative probability) under the curve. We use this property to test for significant dissimilarity differences (Figure 2.2A). In this case, $\alpha=0.088$ and $\beta=0.032$.

Figure 2.2: Cumulative distribution functions (cdfs) of hexamer dissimilarity. (A) Calculation of statistical parameters from cdfs A and B. Overlap in the upper tail of A with B and the lower tail of B with A are likely regions for error. We find the false positive rate $\alpha$ where $1 - cdf_A$ intersects 0 [ $cdf_A(0) = 1 -
\alpha$], and the false negative rate $\beta$ where $cdf_B$ crosses 0. Also shown are the medians $\mu$ for each distribution, where $cdf(\mu) = 0.5$. (B) Calibration curves for plant (Glycine and Medicago, solid black line) and Stramenopile plus P. infestans EST (dashed black line) training sequences. Superimposed distributions of test results show dissimilarity differences for infected G. max (green) and axenic P. sojae mycelia and zoospore sequences (blue and cyan, respectively).
\begin{figure}\begin{center}
\leavevmode
\epsfig{file=origin/figures/Fig2.eps}\end{center}\end{figure}

Calibration curves from hexamer dissimilarity tests, shown in Figure 2.2B as solid black lines for plant and dashed black lines for Stramenopile training sequences, are approximately normal. The medians differ considerably, with only about 10% overlap in the two distributions' tails about the neutral t value of 0. Superimposed are comparison curves from P. sojae test sets (Figure 2.2B), which parallel the GC composition curves in Figure 2.1B but show slightly less variance. Axenic sequences are clearly more like Oomycetes than plants in hexamer composition, with all but a small percentage having positive t values. Plant-like sequences are as abundant in the mixed library as detected by GC composition, about 23%. As expected, the two methods agree, having positively correlated values for GC and t ($r^2=0.852$, $P<2 \times 10^{-16}$, $\nu=2641$).

Looking in more detail at the paired dissimilarity values (Figure 2.3), we can see which individual sequences are more or less like plant and pathogen. The magnitudes of dissimilarity are also apparent, with longer sequences having larger dissimilarity values. BLASTX similarity searches against the protein sequences in nr (from ftp.ncbi.nlm.nih.gov/blast/db) revealed that none of the 12 plant-like mycelial transcripts significantly resemble known proteins ($E>10^{-4}$). Among the top ten most plant-like transcripts from the infected G. max library, 3 had no significant matches, 4 matched putative A. thaliana proteins, and 3 matched known G. max proteins: cytochrome P450 (accession AF022460, $E=9
\times 10^{-35}$), methylglyoxalase (accession P46417, $E=8 \times
10^{-35}$), and a ripening related protein (accession AF127110, $E=4
\times 10^{-71}$). Therefore, the majority (70%) of plant-like transcripts in the infected soybean library strongly resemble characterized plant sequences.

Figure 2.3: Paired dissimilarity test results for infected G. max (green) and axenic P. sojae mycelia and zoospore sequences (blue and cyan, respectively), compared with plant and Stramenopile plus P. infestans EST training sequences. The identity function indicates equal dissimilarity to both training sets, D(A)-D(B)=t=0.
\epsfig{file=origin/figures/Fig3.eps,height=7in}\end{center}\end{figure}

Figure 2.4 shows that calibration curves from comparing plant and microbial symbiont training sets have good separation and minimal overlap (about 10%) in two of three cases, but not for the training set comprised of Zygomycetes and Chytridiomycetes, which overlaps considerably with plants (Figure 2.4B). The associated error rates are $\alpha=0.126$ and $\beta=0.207$. When comparing between plants and bacteria, the error rates are $\alpha=0.052$ and $\beta=0.084$, much lower than when comparing plants (Medicago) with fungi (Zygomycetes and Chytridiomycetes). Error rates for comparing Stramenopiles and P. infestans ESTs with plants are as in Figure 2.2.

Figure 2.4: M. truncatula and microbial symbiont comparisons. Calibration curves compare plant training sets (solid black lines) with one of three microbial symbiont training sets (broken black lines): (A) Stramenopile and P. infestans EST sequences, (B) pooled Zygomycete and Chytridiomycete coding sequences, and (C) sequences from the genera Rhizobium, Sinorhizobium, and Bradyrhizobium. Cumulative distributions of test results from M. truncatula axenic and microbial symbiont mixed cultures appear in each panel (colored lines).
\begin{figure}\begin{center}
\leavevmode
\epsfig{file=origin/figures/Fig4.eps,height=6.5in}\end{center}\end{figure}

Also shown in Figure 2.4 are cumulative distributions from comparisons with M. truncatula and microbial symbionts. All resemble calibration curves from plant sequences, having similar medians and slightly less variance than the plant calibration curves. Comparison curves show that the great majority of test sequences are more plant-like than otherwise, with 20% or less resembling microbial symbionts more closely than plants. A greater proportion of microbial sequences is present in the M. truncatula-G. versiforme interaction library (20%, Figure 2.4B) than in the P. medicaginis-infected M. truncatula library (5%, Figure 2.4A). Contrary to expectations, the negative control test set from S. Long's root-hair enriched library (MtRHE) [34] had a greater proportion of putative microbial sequences present (7% and 25%) than any of the libraries isolated from symbiont-associated cultures! The axenic and nodulating root libraries had the smallest portion of putative microbial transcripts (< 2%, Figure 2.4C), with the axenic library closely resembling nodulating root libraries. The method of preparing a library can affect the proportion of plant and non-plant sequences, as discussed later.

Paired dissimilarity values in Figure 2.5 show in greater detail which sequences are more or less like plant and symbiont. Sequences from an interaction library and axenic negative controls appear together for comparison. Considerable variation in the degree of dissimilarity to both training sets is clear, largely due to variation in the length of sequences within test sets. Consistent with the cumulative distributions of D(A)-D(B) in Figure 2.4, most sequences lie above the diagonal, and resemble the plant host more closely than the microbial symbiont. Mycorrhizal test sequences are more difficult to differentiate than sequences from the rhizobacterial or pathogenic associations, as seen by the diminished variation about the diagonal in mycorrizal comparisons (Figure 2.5B), contrasted with comparisons from pathogen-infected and nodulating root libraries (2.5A and Figure 2.5C, respectively).

Figure 2.5: Paired M. truncatula and microbial symbiont comparisons. Each point indicates the dissimilarity of a test sequence with a plant training set and one of three microbial symbiont training sets: (A) Stramenopile and P. infestans EST sequences, (B) pooled Zygomycete and Chytridiomycete coding sequences, and (C) sequences from the genera Rhizobium, Sinorhizobium, and Bradyrhizobium. Sequences from M. truncatula axenic (green) and microbial symbiont mixed culture libraries are represented in each panel. The color of a point indicates the library from which the transcript was sequenced. The identity function (y=x) is also shown.
\begin{figure}\begin{center}
\leavevmode
\epsfig{file=origin/figures/Fig5.eps,height=6.5in}\end{center}\end{figure}


next up previous contents
Next: Discussion Up: On the Species of Previous: Hexamer Dissimilarity   Contents
Peter T. Hraber 2001-06-13