Sequences from several species of the genus Phytophthora were correctly distinguished from plant and bacterial sequences, and three genes from Agrobacterium tumefaciens were correctly identified as constituting a bacterial sequence. Unlike GC content, the problem is clearly resolved by word counting with a threshold value of t=0, and with statistical rigor, because false positive and false negative rates for a set of comparisons are readily computed from cumulative distributions of dissimilarity between two training sets. Optimal statistical power (minimal false negative rate) is ensured when using a likelihood-ratio test statistic, as demonstrated by the Pearson-Neyman theorem [47].
However, several caveats warrant prudence. Transcribed sequences that do not encode proteins, but rather catalytic single-stranded RNAs such as transfer and ribosomal RNAs [41], should be treated independently because they are more highly conserved across taxa than messenger RNAs. Also, filtering or trimming of low-complexity repeat regions, such as poly-A or poly-T tracts, is helpful because comparison results can be influenced by the abundance of a single hexamer. Early in our investigations, using one set of training sequences obtained from directionally cloned P. infestans cDNAs had produced results that were difficult to interpret. It eventually became clear that, since the P. infestans sequences were all single-pass reads from the 5' end of a clone generated with the T3 primer, few sequences complementary to the 3' end of the mRNA sequence were present in the training set. This meant that the hexamer AAAAAA was common, but the hexamer TTTTTT scarce. Large amounts of the poly-T hexamer would be expected when sequencing reverse complements of messenger RNAs obtained from 3' sequences generated with the T7 primer. Both poly-A and poly-T regions were present among plant training sequences. As a result, any sequence that contained a poly-T tract tended to resemble the plant sequences. Further, because the error rates for an inference depend on the degree to which calibration curves overlap, the best results are obtained where overlap is minimal. Despite these caveats, word counting presents a viable solution to the problem.
The P. sojae and G. max library provides a clear example of contrasting hexamer and GC composition, resulting in readily diagnosed origins. Not every case is this simple. For clear separation between the two species to appear, the two must differ in composition and a detectable proportion of transcripts from each species must be present in the library, i.e. to be detectable, the proportion of transcripts present from a particular species must be greater than the error rate obtained from calibration curves.
Though these criteria are true for the infected G. max library (t < 0 for < 25% of 927 transcripts), they do not appear to be true for the M. truncatula libraries we analyzed (t < 0 for 80-99% of 890-3017 transcripts). In the P. medicaginis interaction library, we might expect the same bimodal distribution as seen with P. sojae. However, the two libraries were prepared in different ways: the P. sojae-infected library was prepared two days after infection [93], whereas the P. medicaginis-infected library was prepared ten days after infection (D. Samac, personal communication). In the former, a susceptible plant host strain was chosen, so as to maximize the number of pathogen transcripts present in the host tissue [93]. In the latter case, individual plants varied in their degree of susceptibility (D. Samac, personal communication). Further, in the former case, hypocotyl tissue was infected directly with a zoospore suspension, as this tissue is known to be very vulnerable to pathogen infection; in the latter case, ground mycelia were dissolved in sterile water and incubated. The resulting inoculum was pipetted onto the soil surface, rather than the plant. These differences in how tissues were cultured prior to library preparation could have produced the disparate abundance of plant transcripts, though both libraries were prepared from plant tissues infected with Phytophthora.
For mycorrhizal root libraries, we might explain the relative lack of symbiont sequences as resulting simply from a relative lack of transcripts in the host tissue. Most of the biomass in mycorrhizal roots is plant biomass [117]. We might therefore expect that most of the transcripts therein originate from the plant host. Confounding this result, the error rates in this comparison are the greatest among all the comparisons we performed, most likely because the evolutionary distance between fungi (Zygomycetes and Chytridiomycetes) and plants is the least among comparisons [120]. Also, Zygomycete protein-coding sequences are rare in GenBank, which resulted in a small training set for these fungi, and may have amplified any biases. The high false negative rate likely led to a failure to detect some symbiont transcripts.
In nodulating root libraries, we do not expect to observe an abundance of bacterial transcripts, because bacteria generally do not form poly-adenylated messenger RNAs [69]. Since the protocols used to extract and purify mRNAs from tissue lysate for the libraries cited in this study all relied on the presence of poly-adenylation sites, we generally do not expect to find bacterial transcripts.
The abundance of putative microbial symbiont transcripts among sequences from negative control axenic root libraries is most difficult to resolve. Error rates were greatest for comparisons between training sets from plant and pooled Zygomycete and Chytridiomycete sequences, but the predicted portion of microbial transcripts was slightly greater in the axenic library than in mixed cultures. Other than providing an 87% confidence level, the 13% false positive rate does not completely explain why about 15% of root-hair enriched transcripts resemble Zygomycete and Chytridiomycete hexamer composition more closely than plants, and warrants further investigation. While the investigators took care to avoid contaminating plant tissue cultures, life is both promiscuous and tenacious, and could conceivably have infected the pure plant tissue cultures. We do not think contamination to be common in the axenic root-hair enriched library, as most transcripts identified more closely resemble known or hypothetical genes from plants than from other taxa. Analysis of another axenic root-hair enriched library, particularly one provided a carbon source to identify potential contaminants, would be an informative test.
These provisional inferences merit experimental verification; errors in database annotation are a notorious problem [17]. The transcripts identified as most and least like plant or symbiont might also be studied in more detail as candidate participants in symbiosis. Symbiotic interactions, whether pathogenic or mutualistic, present novel challenges to both plant hosts and the biologists who study them. Computational approaches, in concert with experimental verification, can help to resolve these challenges.