For a sequence isolated from interacting symbionts, determining its cellular role (or roles) is complicated by not knowing which species expressed the sequence [93]. We refer to this challenge as the problem: given a sequence x expressed in an interaction between species A and B, did x originate from A or B? Various solutions are readily conceived, each with merits and faults. In this contribution, we demonstrate that a comparative lexical analysis of word counts (specifically, hexamer frequencies), previously used to detect library contamination in sequencing projects [128], provides a powerful computational basis to infer a transcript's species of origin.
Experimentally, one can attempt to solve the problem by hybridizing a clone (as probe) to genomic DNA (target) from both species and determining to which target the probe hybridizes. This approach can produce very reliable results. However, if a sequence is highly conserved in the two taxa, hybridization stringency conditions can influence the outcome considerably. For high-throughput EST sequence analysis, source verification by hybridization is impractical in terms of time and reagents. As an alternative to in vitro hybridization, several computational solutions are possible.
Were the genome sequence of both species completely determined, one could simply use sequence similarity searching [4,5,111]. However, most plant hosts and their microbial symbionts have little or no genomic sequence data available, which makes this approach very unreliable. Strong similarity to a sequence from one organism does not preclude the possibility that a similar sequence is present in the other species. Conclusions based upon such partial knowledge have been informative, but potentially misleading [21,93].
Codon usage bias varies across taxa [50,70]. Exploiting this fact may seem a viable solution to the problem, as it has proven suitable for predicting the presence of introns among exons in genomic DNA. However, it really is not practical, due to the need to know the reading frame for translation of a messenger RNA into an amino acid. EST data are of notoriously unreliable quality, sometimes having a large proportion of ambiguous bases, and sometimes having single base-pair insertions or deletions, which disrupt a reading frame. Word counting is less prone to these sources of error, and uses information intrinsic to biases in codon usage by counting codon pairs as hexamers in a sliding window, whereas codons are read in non-overlapping, tiled windows.
An intuitive approach to the problem that examines sequence composition is to compare the guanine and cytosine (GC) base content of a sequence with other sequences from the species being studied. When two species' genomes have different GC content, this method can be very useful. In a recent investigation, for instance, Phytophthora sojae and Glycine max sequences showed a 20% difference in mean GC composition [93]. The origin of a number of sequences could readily be identified this way, but a large proportion could not be, due to considerable overlap in the distributions' tails. Counting frequencies of GC is simple word counting, where the word size k is 1/2 (only two semi-words, G/C and A/T are counted).
An alternative approach to determine the origin of a sequence is
suggested by previous work on word counts, or k-tuple frequencies,
which was intended as a means to evaluate a library for contamination
when sequencing from a single model organism [128]. The
word-counting method provides distinct advantages over other
computational methods. Unlike sequence similarity searching, there is
no need to have sequenced the complete genome of the two interacting
species to make reasonable inferences. Further, word counting
subsumes both GC composition and codon frequencies. That is, the
underlying differences between the two organisms that result in base
composition or codon usage biases can also be detected by counting
words. Dunning's likelihood-ratio test of word dissimilarities
[40] also has the appealing property of being
non-parametric, having no assumption of normality for the underlying
frequency distribution, which makes it statistically powerful
[47]. Dunning demonstrated that unreliable results can be
obtained from parametric tests, such as
, particularly in
such cases as lexical analysis [40].
In the experiments detailed below, we first validate the word-counting method on sequences whose origin and function are known, then compare it with GC base composition distributions' ability to diagnose the origin of sequences obtained from symbiotic interactions. We examine sequences from pathogenic interactions between species from the genus Phytophthora and the plant hosts Glycine max and Medicago truncatula, then apply the word-counting approach to sequences from two microbial mutualists in association with M. truncatula, the arbuscular mycorrhizal Zygomycete Glomus versiforme, and the nitrogen-fixing bacterium Sinorhizobium meliloti.