Starting with a full set of sequences, we filtered for high-quality sequences by trimming N-rich and poly-A or poly-T regions. The test statistic can be sensitive to the abundance of a single word [128]. Thus, we trimmed poly-A and poly-T sites to minimize the cases in which a test sequence resembles one training set more closely than the other, simply by virtue of having an abundance of the hexamer AAAAAA or TTTTTT. Similarly, test results obtained from short or N-rich sequences can be difficult to interpret [128]. We allowed no more than one N per hexamer and trimmed poly-A or poly-T tracts longer than 13 nt. To accommodate for possible sequence chimeras, those sequences found to contain an internal poly-A or poly-T segment longer than 13 nt were partitioned into two fragments, and the longer of the two fragments was used in analysis, provided its length was at least 300 nt.
After trimming, we screened all remaining sequences of 300 nt or longer for similarity to E. coli using BLASTN [4,5]. All BLAST searches used default parameters and low-complexity filtering with the programs DUST or SEG. The decision to exclude non-coding RNA sequences from training sets was informed by the appearance of bimodal distributions of hexamer frequencies and a large degree of overlap between calibration curves (not shown), likely a result of divergent evolutionary rates between protein-coding and non-coding sequences [48,57]. Chloroplast and mitochondrial sequences were eliminated to avoid complications due to variation in codon usage between nuclear and organellar genomes.
| TAXON | RAW | TRIMMED | SCREENED | |||
| n | nt | n | nt | n | nt | |
| Glycine | 892 | 1,265,829 | 834 | 1,219,114 | 826 | 1,184,951 |
| Medicago | 401 | 561,104 | 382 | 519,739 | 380 | 513,868 |
| Total, Plants | 1206 | 1,698,819 | ||||
| Stramenopiles | 199 | 299,113 | 184 | 287,600 | 181 | 279,900 |
| P. infestans | 2131 | 1,219,463 | 2102 | 1,209,113 | 2082 | 1,199,372 |
| Total, Stramenopiles | 2263 | 1,479,272 | ||||
| Zygomycetes | 232 | 343,817 | 212 | 329,222 | 211 | 327,229 |
| Chytridiomycetes | 82 | 123,698 | 78 | 119,754 | 78 | 119,754 |
| Total, Fungi | 289 | 446,983 | ||||
| Rhizobium | 478 | 1,430,132 | 444 | 1,404,883 | 444 | 1,404,883 |
| Sinorhizobium | 320 | 900,294 | 312 | 898,687 | 312 | 898,687 |
| Bradyrhizobium | 153 | 471,309 | 146 | 465,307 | 146 | 465,307 |
| Total, Rhizobia | 902 | 2,768,877 | ||||
Table 2.1 summarizes counts of sequences and nucleotides in training sets before and after trimming and screening. All training sets obtained using the procedure described above, as well as the routines used to obtain them, are available as supplementary material at www.santafe.edu/~pth/diss.