Beginning with raw sequences from GenBank, all libraries were filtered and screened for quality. The procedure was as described in Section 2.3.1, but with slightly different parameters. Filtering consisted of trimming N-rich, poly-A, and poly-T regions longer than 6 nt. This was done because hexamer dissimilarity calculations can be influenced by an excess abundance of a single hexamer, although the likelihood-ratio test statistic is sensitive to rare events (cf. Section 2.3.1, [40], and [128]). To accommodate for possible sequence chimeras, those sequences found to contain an internal poly-A or poly-T segment longer than 6 nt were partitioned into two fragments, and the longer of the two fragments was analyzed, provided its length was at least 100 nt.
Screening consisted of running BLASTN searches [5] to identify any ribosomal DNA or E. coli sequences. Sequences so identified were withheld from analysis. The program tRNASCAN-SE [75], (version 1.21, available at ftp.genetics.wustl.edu/pub/eddy/software/tRNAscan-SE.tar.Z) verified that no transfer RNA sequences were present. Instances of polylinker were identified and removed from among transcripts in the Harrison library, using BLASTN [5], prior to filtering and screening for quality (M. Harrison, personal communication).
Only transcripts having a length of 100 nt or greater after filtering were analyzed. The resulting sets of sequences are available electronically as supplementary material, which can be accessed at www.santafe.edu/~pth/diss.