next up previous contents
Next: Data Quality Up: Training Sequences Previous: Training Sequences   Contents


Calibration

To characterize hexamer frequencies in plant hosts and their microbial symbionts, we collected sets of training sequences from public databases and edited them for quality. Training sets were chosen to be representative of, but obtained independently from, taxa participating in symbiotic associations for which a diagnosis of origin would be made. Because the species being compared are represented unevenly in public sequence databases, taxa were chosen so that roughly the same number of genes were analyzed in each training set, rather than simply to maximize the numbers of species or sequences present.

Training sets represent protein-coding sequences from three taxonomic groupings: plants (Medicago and Glycine), either fungi (Zygomycetes and Chytridiomycetes) or Stramenopiles (including ESTs from Phytophthora infestans [66]), and bacteria (Rhizobium, Sinorhizobium, and Bradyrhizobium). We performed pairwise comparisons with two different, taxon-specific training sets to infer the origin of a transcript.

Training sets were obtained by querying the GenBank database using the Entrez retrieval tool [13,126,127]. A preliminary query by taxon name obtained all available nucleotide sequences from that taxon, then the Limits option excluded ESTs, STSs (sequence-tagged sites), GSSs (genome survey sequences), working draft sequences, and patented sequences from the query set. Organellar (mitochondrial and chloroplast) DNA was also excluded via the Limits option. A query term to require that a sequence contain a protein-coding region (CDS) was also added, which excluded ribosomal and transfer RNA sequences. The results consisted of all sequences that contain a nuclear-protein coding sequence available for that taxon at the time of the query. This was done on two separate occasions: in April and October, 2000 AD. (Changing slightly the composition of training sets between those dates did not notably affect the experimental outcome.)

Following a previously established protocol [128], we used a resampling procedure to evaluate the degree of overlap between distributions of hexamer composition obtained from comparing two training sets. In this protocol, we resampled each training set forty times by random partitioning into training (for hexamer counts) and test calculation pools. To control for any bias introduced by length variation, a program randomly clipped 300 nucleotide (nt) fragments for word counting. As a result, one random 300 nt fragment from each training sequence was present in the training set during a single resampling replicate; independent replicates contained different, randomly chosen training sequences and 300 nt fragments. Values of the test statistic from forty resampled replicates were pooled together for calibration purposes.

As with the original protocol [128], we pooled the resulting test statistic distributions, normalized them as cumulative distributions, and then evaluated them for overlap. We call the resulting comparisons calibration curves, as they are not used directly to make inferences, but rather indirectly as a means to evaluate the degree of separation in hexamer counts from different taxa. Overlap of calibration curves should be minimal to yield the most statistically powerful results possible.

Due to considerable overlap of calibration curves between taxonomically general, inclusive training sets (i.e., all eudicots, all fungi and miscellaneous eukaryotes, and all eubacteria, not shown), we opted to work with specific training sets that included only the most species-specific sequences available, while maintaining approximately equal sample sizes across taxa.

The most challenging case was that of the arbuscular mycorrhizal fungi, for which very few protein-coding sequences are available. To increase the amount of data in this training set without biasing sample sizes, we pooled sequences from all species in the Zygomycetes with all available Chytridiomycete coding sequences, and compared this training set with a set from a single plant genus, Medicago. We chose this option, rather than including an arbitrary subset of sequences from the Ascomycetes and Basidiomycetes, because Zygomycetes and Chytridiomycetes appear to have diverged from their common ancestor less recently than the Ascomycetes and Basidiomycetes, based on 18S ribosomal RNA sequence data [120]. That is, the Ascomycetes and Basidiomycetes are more highly derived from the common fungal ancestor than Zygomycetes and Chytridiomycetes, which resemble more closely their ancestral state in modern lineages than the Ascomycetes and Basidiomycetes [120].


next up previous contents
Next: Data Quality Up: Training Sequences Previous: Training Sequences   Contents
Peter T. Hraber 2001-06-13