| ACCESSION | GENE NAME | mRNA | L | |
|
|
| ? | (nt) | PLANTS | STRAMEN. | BACTERIA | ||
| Glomus versiforme | ||||||
| AJ009628 | chitin synthase Gvchs1 | N | 638 | 2535.2 | 2468.6 | 2718.4 |
| AJ009629 | chitin synthase Gvchs2 | N | 481 | 2203.2 | 2050.0 | 2286.0 |
| AJ009630 | chitin synthase Gvchs3 | N | 4116 | 7205.9 | 5235.8 | 5985.8 |
| U38650 | phosphate transporter | Y | 1833 | 3937.9 | 5702.3 | 6514.3 |
| Glycine max | ||||||
| J01297 | actin SAc3 | N | 1620 | 3322.0 | 4554.6 | 5329.7 |
| K00821 | lectin Le1 | N | 2152 | 4124.6 | 6558.3 | 7928.3 |
| M64267 | iron superoxide dismutase | Y | 1056 | 2773.6 | 3761.2 | 4269.2 |
| Medicago truncatula | ||||||
| AF000354 | phosphate transporter MtPT1 | Y | 1920 | 3800.3 | 5630.7 | 6654.2 |
| AF000355 | phosphate transporter MtPT2 | Y | 1867 | 3673.9 | 5390.1 | 6424.0 |
| AF055921 | Mt4 genomic sequence | N | 954 | 2631.9 | 4004.4 | 4539.1 |
| AF106929 | cell wall protein AM1 | Y | 885 | 3433.6 | 4200.0 | 4774.3 |
| AF106930 | translation initiation protein AM3-1 | Y | 3154 | 4557.6 | 5982.7 | 7212.8 |
| AF106931 | translation initiation protein AM3-2 | Y | 1384 | 3371.1 | 4130.0 | 4644.4 |
| AJ132891 | ha1 gene, exons 1-22 | N | 3620 | 4383.2 | 8683.6 | 10730.7 |
| AJ388847 | MtNo213 superoxide dismutase | Y | 530 | 2110.2 | 2219.8 | 2367.0 |
| AJ388865 | MtNo233 triosephosphate isomerase | Y | 563 | 2171.6 | 2405.6 | 2618.6 |
| U16727 | peroxidase precursor rip1 | N | 2603 | 4246.1 | 8210.0 | 9901.9 |
| U38651 | sugar transporter | Y | 1728 | 3619.6 | 5128.4 | 5976.5 |
| X57732 | leghemoglobin Mtlb1 | N | 1073 | 3021.3 | 5029.1 | 5845.9 |
| X57733 | leghemoglobin Mtlb2 | N | 592 | 2045.9 | 3156.0 | 3568.2 |
| X60386 | lectin lec1 | N | 1363 | 3228.8 | 4935.4 | 5605.6 |
| X60387 | lectin lec2 | N | 1192 | 3142.8 | 4472.6 | 4985.9 |
| X82216 | lec3 | N | 1155 | 2928.4 | 4283.3 | 4930.8 |
| X68032 | ENOD12 | N | 772 | 2780.4 | 3679.7 | 4096.5 |
| X99466 | ENOD16 | N | 1142 | 3124.6 | 4535.5 | 5156.2 |
| X99467 | ENOD20 | N | 1405 | 4003.6 | 5294.7 | 5966.7 |
| Y10267 | glutamine synthetase | Y | 1413 | 3116.1 | 4506.5 | 5292.1 |
| Y10373 | chitinase | Y | 1305 | 3369.5 | 4090.4 | 4703.4 |
| Phytophthora infestans | ||||||
| AF004951 | surface glycoprotein elicitor, inf2A | Y | 648 | 3428.4 | 2421.9 | 2589.1 |
| AF004952 | surface glycoprotein elicitor, inf2B | Y | 701 | 3611.7 | 2514.5 | 2698.6 |
| L23938 | ipiO2 | N | 1556 | 4125.2 | 4339.5 | 4855.5 |
| L23939 | ipiO1 | N | 1826 | 4360.5 | 4580.9 | 5259.0 |
| L24206 | ipiB1 | N | 1726 | 6086.7 | 4584.3 | 5159.3 |
| M59715 | actin actA | N | 1736 | 5137.1 | 3637.2 | 4420.0 |
| M59716 | actin actB | N | 1405 | 4425.3 | 3569.5 | 4141.6 |
| M83535 | calmodulin calA | N | 1358 | 4063.0 | 3724.9 | 4138.1 |
| X64537 | tigA | N | 2448 | 6221.0 | 4193.8 | 5181.9 |
| P. capsici | ||||||
| U42304 | chitin synthase chs | N | 449 | 2238.8 | 1882.7 | 1997.5 |
| P. parasitica | ||||||
| X97205 | cellulose binding-elicitor lectin | Y | 918 | 3819.1 | 2876.0 | 3208.4 |
| Sinorhizobium meliloti | ||||||
| AF040724 | nodD | N | 1776 | 5317.9 | 4197.8 | 4179.4 |
| AF110770 | superoxide dismutase sodA | N | 1196 | 4898.2 | 3343.2 | 2916.0 |
| M61753 | exoD | N | 858 | 4071.8 | 2847.2 | 2372.9 |
| M68858 | nodulation protein nodP & nodQ | N | 3476 | 9992.3 | 5288.0 | 3954.7 |
| M96261 | phosphate regulators, phoU & phoB | N | 1178 | 5332.7 | 3359.3 | 2866.4 |
| U90221 | syrA | N | 1102 | 4176.2 | 3375.8 | 3220.8 |
| X01649 | nodA, nodB, & nodC | N | 3373 | 7684.1 | 4819.5 | 4646.1 |
| X03065 | regulatory nitrogen fixation fixD | N | 2111 | 5723.0 | 4249.8 | 4228.3 |
| X17523 | glutamine synthetase II | N | 990 | 4303.5 | 2959.9 | 2720.4 |
| Y08500 | putA | N | 3804 | 13623.3 | 6376.5 | 4212.0 |
| Agrobacterium tumefaciens | ||||||
| U91632 | sugar transporter gguA & membrane- | |||||
| spanning permeases gguB & gguC | N | 4185 | 11132.6 | 5959.4 | 4551.4 | |
Table 2.4 summarizes the percent GC content and hexamer composition among the ten most abundant hexamers for training sequences. Plants and fungi (pooled Zygomycetes and Chytridiomycetes) have below 50% mean GC composition, their means differing by less than 3%, and have large standard deviations (5-8%). Sequences from Stramenopiles (pooled with P. infestans ESTs) and from rhizobacteria have greater than 50% mean GC content. Thus, fungi and plants are least distinguishable based on GC content, whereas Stramenopiles and rhizobacteria are more readily distinguished from plants. Differences in GC content between taxa are also reflected in the common hexamers. Abundance of the poly-A and poly-T hexamers among plants and fungi is clearly evident, but less apparent among Stramenopiles and rhizobacteria (Table 2.4).
| PLANTS | STRAMENOPILES | FUNGI | RHIZOBIA | |||||
| |
s | |
s | |
s | |
s | |
| % GC | 40.7 | 5.18 | 55.8 | 4.24 | 43.6 | 8.47 | 60.0 | 3.57 |
| RANK | ||||||||
| 1 | 0.250 | AAAAAA | 0.126 | AAGAAG | 0.257 | AAAAAA | 0.159 | GCCGGC |
| 2 | 0.182 | TTTTTT | 0.113 | CAAGGA | 0.184 | TTTTTT | 0.159 | CGCCGC |
| 3 | 0.118 | TTTGTT | 0.111 | CAAGAA | 0.114 | CAAGAA | 0.157 | GGCGGC |
| 4 | 0.117 | ATTTTT | 0.108 | AAAAAA | 0.110 | GGTGGT | 0.157 | GCGGCG |
| 5 | 0.117 | TATTTT | 0.105 | GCTGCT | 0.109 | AATAAA | 0.156 | CGGCGC |
| 6 | 0.111 | TTGTTT | 0.098 | CTGCTG | 0.108 | AAGAAG | 0.150 | CGGCGA |
| 7 | 0.110 | ATATAT | 0.096 | AAGGAG | 0.106 | AAAGAA | 0.147 | TCGGCG |
| 8 | 0.109 | TTATTT | 0.096 | GCCAAG | 0.104 | AAAAAT | 0.147 | TCGCCG |
| 9 | 0.109 | GAAGAA | 0.094 | TGCTGC | 0.104 | AAATAA | 0.141 | GCGCCG |
| 10 | 0.108 | AAAAAT | 0.092 | GAGGAG | 0.100 | TGGTGG | 0.141 | CCGGCG |
Distributions of GC content are approximately normal in two of three cases studied, those of axenic P. sojae cultures (Figure 2.1). For sequences from infected plant cultures, a bimodal distribution is apparent. Roughly 25% from a total of 927 infected G. max sequences contain less than 50% GC; most of these are likely plant transcripts [93]. This is considerably greater than in axenic P. sojae cultures, in which fewer than 5% of mycelia and zoospore isolates contain less than 50% GC.
![]() |
Several properties of cumulative distribution functions warrant
comment, to help explain similar plots from word dissimilarity
comparisons (Figures 2.1B and
2.2A). The median
of a distribution occurs where the function reaches a cumulative
probability of 0.5. Medians from all three P. sojae libraries
are similar, varying by less than 4% GC
(Figure 2.1B). Other moments of the distributions
are readily apparent; the variance is inversely related to the slope
at the median value of the function. A useful property of cumulative
distribution functions is that any point on the y axis gives the
integrated area (cumulative probability) under the curve. We use this
property to test for significant dissimilarity differences
(Figure 2.2A). In this case,
and
.
![]() |
Calibration curves from hexamer dissimilarity tests, shown in
Figure 2.2B as solid black lines for plant and dashed
black lines for Stramenopile training sequences, are approximately
normal. The medians differ considerably, with only about 10%
overlap in the two distributions' tails about the neutral t value of
0. Superimposed are comparison curves from P. sojae test
sets (Figure 2.2B), which parallel the GC composition
curves in Figure 2.1B but show slightly less variance.
Axenic sequences are clearly more like Oomycetes than plants in
hexamer composition, with all but a small percentage having positive
t values. Plant-like sequences are as abundant in the mixed library
as detected by GC composition, about 23%. As expected, the two
methods agree, having positively correlated values for GC and t
(
,
,
).
Looking in more detail at the paired dissimilarity values
(Figure 2.3), we can
see which individual sequences are more or less like plant and
pathogen. The magnitudes of dissimilarity are also apparent, with
longer sequences having larger dissimilarity values. BLASTX
similarity searches against the protein sequences in nr (from
ftp.ncbi.nlm.nih.gov/blast/db) revealed that none of the 12
plant-like mycelial transcripts significantly resemble known proteins
(
). Among the top ten most plant-like transcripts from the
infected G. max library, 3 had no significant matches, 4
matched putative A. thaliana proteins, and 3 matched known
G. max proteins: cytochrome P450 (accession AF022460,
), methylglyoxalase (accession P46417,
), and a ripening related protein (accession AF127110,
). Therefore, the majority (70%) of plant-like
transcripts in the infected soybean library strongly resemble
characterized plant sequences.
![]() |
Figure 2.4 shows
that calibration curves from comparing plant and microbial symbiont
training sets have good separation and minimal overlap (about 10%) in
two of three cases, but not for the training set comprised of
Zygomycetes and Chytridiomycetes, which overlaps considerably with
plants (Figure 2.4B). The associated error rates
are
and
. When comparing between plants
and bacteria, the error rates are
and
,
much lower than when comparing plants (Medicago) with fungi
(Zygomycetes and Chytridiomycetes). Error rates for comparing
Stramenopiles and P. infestans ESTs with plants are as in
Figure 2.2.
![]() |
Also shown in Figure 2.4 are cumulative distributions from comparisons with M. truncatula and microbial symbionts. All resemble calibration curves from plant sequences, having similar medians and slightly less variance than the plant calibration curves. Comparison curves show that the great majority of test sequences are more plant-like than otherwise, with 20% or less resembling microbial symbionts more closely than plants. A greater proportion of microbial sequences is present in the M. truncatula-G. versiforme interaction library (20%, Figure 2.4B) than in the P. medicaginis-infected M. truncatula library (5%, Figure 2.4A). Contrary to expectations, the negative control test set from S. Long's root-hair enriched library (MtRHE) [34] had a greater proportion of putative microbial sequences present (7% and 25%) than any of the libraries isolated from symbiont-associated cultures! The axenic and nodulating root libraries had the smallest portion of putative microbial transcripts (< 2%, Figure 2.4C), with the axenic library closely resembling nodulating root libraries. The method of preparing a library can affect the proportion of plant and non-plant sequences, as discussed later.
Paired dissimilarity values in Figure 2.5 show in greater detail which sequences are more or less like plant and symbiont. Sequences from an interaction library and axenic negative controls appear together for comparison. Considerable variation in the degree of dissimilarity to both training sets is clear, largely due to variation in the length of sequences within test sets. Consistent with the cumulative distributions of D(A)-D(B) in Figure 2.4, most sequences lie above the diagonal, and resemble the plant host more closely than the microbial symbiont. Mycorrhizal test sequences are more difficult to differentiate than sequences from the rhizobacterial or pathogenic associations, as seen by the diminished variation about the diagonal in mycorrizal comparisons (Figure 2.5B), contrasted with comparisons from pathogen-infected and nodulating root libraries (2.5A and Figure 2.5C, respectively).
![]() |