next up previous contents
Next: Methods Up: On the Species of Previous: Synopsis   Contents

Introduction

Access to automated DNA sequencing technology has made possible the rapid generation and analysis of gene transcripts expressed in organisms via expressed sequence tags, or ESTs [1,2,33,102,113]. This information has helped to identify those genes expressed in particular stages of development and in specialized tissues or organs [34,58,60,82,83]. Novel gene products and target leads for therapeutic intervention can also be gleaned rapidly from ESTs [1,2,20]. A more detailed understanding of the molecular interactions between symbionts, whether pathogenic or mutualistic, is also possible with this approach [14,51,55,66,119].

For a sequence isolated from interacting symbionts, determining its cellular role (or roles) is complicated by not knowing which species expressed the sequence [93]. We refer to this challenge as the problem: given a sequence x expressed in an interaction between species A and B, did x originate from A or B? Various solutions are readily conceived, each with merits and faults. In this contribution, we demonstrate that a comparative lexical analysis of word counts (specifically, hexamer frequencies), previously used to detect library contamination in sequencing projects [128], provides a powerful computational basis to infer a transcript's species of origin.

Experimentally, one can attempt to solve the problem by hybridizing a clone (as probe) to genomic DNA (target) from both species and determining to which target the probe hybridizes. This approach can produce very reliable results. However, if a sequence is highly conserved in the two taxa, hybridization stringency conditions can influence the outcome considerably. For high-throughput EST sequence analysis, source verification by hybridization is impractical in terms of time and reagents. As an alternative to in vitro hybridization, several computational solutions are possible.

Were the genome sequence of both species completely determined, one could simply use sequence similarity searching [4,5,111]. However, most plant hosts and their microbial symbionts have little or no genomic sequence data available, which makes this approach very unreliable. Strong similarity to a sequence from one organism does not preclude the possibility that a similar sequence is present in the other species. Conclusions based upon such partial knowledge have been informative, but potentially misleading [21,93].

Codon usage bias varies across taxa [50,70]. Exploiting this fact may seem a viable solution to the problem, as it has proven suitable for predicting the presence of introns among exons in genomic DNA. However, it really is not practical, due to the need to know the reading frame for translation of a messenger RNA into an amino acid. EST data are of notoriously unreliable quality, sometimes having a large proportion of ambiguous bases, and sometimes having single base-pair insertions or deletions, which disrupt a reading frame. Word counting is less prone to these sources of error, and uses information intrinsic to biases in codon usage by counting codon pairs as hexamers in a sliding window, whereas codons are read in non-overlapping, tiled windows.

An intuitive approach to the problem that examines sequence composition is to compare the guanine and cytosine (GC) base content of a sequence with other sequences from the species being studied. When two species' genomes have different GC content, this method can be very useful. In a recent investigation, for instance, Phytophthora sojae and Glycine max sequences showed a 20% difference in mean GC composition [93]. The origin of a number of sequences could readily be identified this way, but a large proportion could not be, due to considerable overlap in the distributions' tails. Counting frequencies of GC is simple word counting, where the word size k is 1/2 (only two semi-words, G/C and A/T are counted).

An alternative approach to determine the origin of a sequence is suggested by previous work on word counts, or k-tuple frequencies, which was intended as a means to evaluate a library for contamination when sequencing from a single model organism [128]. The word-counting method provides distinct advantages over other computational methods. Unlike sequence similarity searching, there is no need to have sequenced the complete genome of the two interacting species to make reasonable inferences. Further, word counting subsumes both GC composition and codon frequencies. That is, the underlying differences between the two organisms that result in base composition or codon usage biases can also be detected by counting words. Dunning's likelihood-ratio test of word dissimilarities [40] also has the appealing property of being non-parametric, having no assumption of normality for the underlying frequency distribution, which makes it statistically powerful [47]. Dunning demonstrated that unreliable results can be obtained from parametric tests, such as $\chi ^2$, particularly in such cases as lexical analysis [40].

In the experiments detailed below, we first validate the word-counting method on sequences whose origin and function are known, then compare it with GC base composition distributions' ability to diagnose the origin of sequences obtained from symbiotic interactions. We examine sequences from pathogenic interactions between species from the genus Phytophthora and the plant hosts Glycine max and Medicago truncatula, then apply the word-counting approach to sequences from two microbial mutualists in association with M. truncatula, the arbuscular mycorrhizal Zygomycete Glomus versiforme, and the nitrogen-fixing bacterium Sinorhizobium meliloti.


next up previous contents
Next: Methods Up: On the Species of Previous: Synopsis   Contents
Peter T. Hraber 2001-06-13