next up previous contents
Next: Validation Up: Training Sequences Previous: Calibration   Contents


Data Quality

Starting with a full set of sequences, we filtered for high-quality sequences by trimming N-rich and poly-A or poly-T regions. The test statistic can be sensitive to the abundance of a single word [128]. Thus, we trimmed poly-A and poly-T sites to minimize the cases in which a test sequence resembles one training set more closely than the other, simply by virtue of having an abundance of the hexamer AAAAAA or TTTTTT. Similarly, test results obtained from short or N-rich sequences can be difficult to interpret [128]. We allowed no more than one N per hexamer and trimmed poly-A or poly-T tracts longer than 13 nt. To accommodate for possible sequence chimeras, those sequences found to contain an internal poly-A or poly-T segment longer than 13 nt were partitioned into two fragments, and the longer of the two fragments was used in analysis, provided its length was at least 300 nt.

After trimming, we screened all remaining sequences of 300 nt or longer for similarity to E. coli using BLASTN [4,5]. All BLAST searches used default parameters and low-complexity filtering with the programs DUST or SEG. The decision to exclude non-coding RNA sequences from training sets was informed by the appearance of bimodal distributions of hexamer frequencies and a large degree of overlap between calibration curves (not shown), likely a result of divergent evolutionary rates between protein-coding and non-coding sequences [48,57]. Chloroplast and mitochondrial sequences were eliminated to avoid complications due to variation in codon usage between nuclear and organellar genomes.


Table 2.1: Training sets: Number of sequences (n) and nucleotides (nt), as raw, trimmed (removed N-rich regions, poly-A and poly-T sites), and screened sequences (removed ribosomal, chloroplast, and mitochondrial DNA and remaining sequences shorter than 300 nt).
TAXON RAW TRIMMED SCREENED
n nt n nt n nt
Glycine 892 1,265,829 834 1,219,114 826 1,184,951
Medicago 401 561,104 382 519,739 380 513,868
Total, Plants 1206 1,698,819
Stramenopiles 199 299,113 184 287,600 181 279,900
P. infestans 2131 1,219,463 2102 1,209,113 2082 1,199,372
Total, Stramenopiles 2263 1,479,272
Zygomycetes 232 343,817 212 329,222 211 327,229
Chytridiomycetes 82 123,698 78 119,754 78 119,754
Total, Fungi 289 446,983
Rhizobium 478 1,430,132 444 1,404,883 444 1,404,883
Sinorhizobium 320 900,294 312 898,687 312 898,687
Bradyrhizobium 153 471,309 146 465,307 146 465,307
Total, Rhizobia 902 2,768,877

Table 2.1 summarizes counts of sequences and nucleotides in training sets before and after trimming and screening. All training sets obtained using the procedure described above, as well as the routines used to obtain them, are available as supplementary material at www.santafe.edu/~pth/diss.


next up previous contents
Next: Validation Up: Training Sequences Previous: Calibration   Contents
Peter T. Hraber 2001-06-13