| TAXON | RAW | FILTERED | ||
| n | nt | n | nt | |
| Chytridiomycota | 28 | 52,737 | 28 | 50,198 |
| Zygomycota | 345 | 461,423 | 335 | 402,834 |
| TOTAL, FUNGI | 363 | 453,032 | ||
| Daucus | 20 | 98,071 | 20 | 64,907 |
| Glycine | 411 | 773,236 | 411 | 533,121 |
| Medicago | 123 | 251,882 | 115 | 156,884 |
| TOTAL, PLANTS | 546 | 754,912 | ||
| Bradyrhizobium | 165 | 948,036 | 165 | 943,954 |
| Rhizobium | 608 | 2,339,473 | 607 | 2,260,533 |
| Sinorhizobium | 336 | 965,446 | 335 | 963,287 |
| TOTAL, RHIZOBACTERIA | 1107 | 4,167,774 | ||
To characterize hexamer frequencies in plant hosts and their microbial
symbionts, we collected sets of training sequences from GenBank and
edited them for quality. Using methods described in
Section 2.3.5, we queried GenBank to
gather training sets of protein-coding sequences from three taxa:
(
) fungi (Zygomycetes and Chytridiomycetes), (
) plants ( Daucus, Medicago, and Glycine spp.), and (
)
rhizobacteria (Rhizobium, Sinorhizobium, and Bradyrhizobium spp.). A further constraint was imposed here, that
the molecule isolated for sequencing should be DNA, rather than mRNA,
to minimize instances of the hexamers AAAAAA and
TTTTTT. In the case of eukaryotic taxa, use of DNA sequences
to the exclusion of mRNAs implies that non-coding introns are
present in the training sets.
Training sets were subjected to the filtering and screening procedures described above, namely that N-rich regions be excluded, poly-A and poly-T stretches be trimmed to no longer than 8 nt, non-protein coding organellar and ribosomal DNA sequences be excluded, and that the final sequence be at least 100 nt in length. The resulting training sets are available as supplementary material, and are summarized in Table 4.2.