next up previous contents
Next: Hexamer Dissimilarity Comparisons Up: Data Analysis Previous: Data Analysis   Contents


Training Sets


Table 4.2: Training sets: number of sequences (n) and nucleotides (nt) used to calculate hexamer dissimilarity. Data are summarized before (raw) and after (filtered) trimming to remove poly-A and poly-T regions, and screening to remove ribosomal RNA sequences.
TAXON RAW FILTERED
  n nt n nt
Chytridiomycota 28 52,737 28 50,198
Zygomycota 345 461,423 335 402,834
TOTAL, FUNGI     363 453,032
         
Daucus 20 98,071 20 64,907
Glycine 411 773,236 411 533,121
Medicago 123 251,882 115 156,884
TOTAL, PLANTS     546 754,912
         
Bradyrhizobium 165 948,036 165 943,954
Rhizobium 608 2,339,473 607 2,260,533
Sinorhizobium 336 965,446 335 963,287
TOTAL, RHIZOBACTERIA     1107 4,167,774

To characterize hexamer frequencies in plant hosts and their microbial symbionts, we collected sets of training sequences from GenBank and edited them for quality. Using methods described in Section 2.3.5, we queried GenBank to gather training sets of protein-coding sequences from three taxa: ($A$) fungi (Zygomycetes and Chytridiomycetes), ($B_1$) plants ( Daucus, Medicago, and Glycine spp.), and ($B_2$) rhizobacteria (Rhizobium, Sinorhizobium, and Bradyrhizobium spp.). A further constraint was imposed here, that the molecule isolated for sequencing should be DNA, rather than mRNA, to minimize instances of the hexamers AAAAAA and TTTTTT. In the case of eukaryotic taxa, use of DNA sequences to the exclusion of mRNAs implies that non-coding introns are present in the training sets.

Training sets were subjected to the filtering and screening procedures described above, namely that N-rich regions be excluded, poly-A and poly-T stretches be trimmed to no longer than 8 nt, non-protein coding organellar and ribosomal DNA sequences be excluded, and that the final sequence be at least 100 nt in length. The resulting training sets are available as supplementary material, and are summarized in Table 4.2.


next up previous contents
Next: Hexamer Dissimilarity Comparisons Up: Data Analysis Previous: Data Analysis   Contents
Peter T. Hraber 2001-06-13