next up previous contents
Next: Library Diversity and Complementarity Up: Results Previous: Comparison Sequences   Contents

GC Content and Common Hexamers

Figure 4.6 summarizes GC content in training sets, among validation sequences from Glomus and Medicago spp., and in test libraries. Plant sequences have lower GC than in fungi, which in turn have lower GC than rhizobacteria. The same is true for validation sequences, though the separation is not as great as in training sets. Curiously, this pattern is reversed for axenic plant and fungal test sequences. The Mt Long library has GC content comparable to validation sequences sampled from M. truncatula, intermediate between GC in plant and fungal training sets. Libraries from Glomus intraradices have about 10% lower GC than the fungal training set, and resemble more closely the plant training set. Overlap in GC content with the fungal training set is greatest for the Lammers library, less so for the Harrison library, and not at all for the Sawaki library.

Figure 4.6: GC content of training sets, validation sequences from Glomus and Medicago spp., and test libraries. Error bars indicate standard deviation of the mean.

If we compare GC content (Figure 4.6) with abundant hexamers (Table 4.4 and 4.5), we note that the two comparative measures of composition are related within a set of sequences. Test sequences contain AT-rich common hexamers, and GC residues appear in the Long plant library more often than in the fungal libraries (Table 4.5). This pattern is the opposite of what was seen in training sets (Table 4.4), where common plant hexamers are dominated by T residues, and common fungal hexamers are less biased in composition.


Table 4.4: Common hexamers among training sequences. For each taxon, the ten most abundant hexamers (by rank) in each training set, with percentage of all hexamers represented by that hexamer, are shown.
RANK FUNGI PLANTS RHIZOBACTERIA
1 0.158 CAAGAA 0.191 TTTTTT 0.143 CGCCGC
2 0.152 TGGTAT 0.177 TATTTT 0.142 GGCGGC
3 0.151 TCAAGG 0.168 TTTTAT 0.142 GCGGCG
4 0.148 GTCAAG 0.166 TTATTT 0.142 GCCGGC
5 0.147 TCAAGA 0.162 ATTTTT 0.137 CGGCGC
6 0.146 CAAGGA 0.161 TGTTTT 0.133 CGGCGA
7 0.128 CTGGTA 0.159 TTTGTT 0.130 TCGCCG
8 0.127 ATCAAG 0.157 TTTATT 0.129 TCGGCG
9 0.126 GCTGGT 0.155 TTGTTT 0.127 GCGCCG
10 0.126 CGTCAA 0.154 TTAATT 0.127 CCGGCG


Table 4.5: Ten most common hexamers among test sequence libraries. Shown for each library are the ten most abundant hexamers (by rank), with percentage of all hexamers represented by that hexamer.
RANK MT LONG GI HARRISON GI LAMMERS GI SAWAKI
1 0.173 GAAGAA 0.247 AAAGAA 0.225 TTATTA 0.263 AAGAAA
2 0.149 AAGAAG 0.230 AAAAGA 0.221 TTTATT 0.246 AAAGAA
3 0.143 AAGAAA 0.215 AAAAAT 0.221 ATTATT 0.227 GAAAAA
4 0.126 TTCTTC 0.212 AAGAAA 0.212 ATTTTT 0.223 AAAAAT
5 0.122 TGAAGA 0.207 AAAGAT 0.192 TATTAT 0.222 AAAAGA
6 0.122 TTGTTG 0.204 CAAAAA 0.183 TTATTT 0.215 AATAAA
7 0.120 AAAGAA 0.202 AAAAAG 0.183 TATTTT 0.205 AAAATT
8 0.116 AGAAGA 0.195 TGATGA 0.177 AATTTT 0.200 AAATTA
9 0.115 TGATGA 0.191 AAAATT 0.172 TTTTAT 0.197 ATTATT
10 0.113 TTTGTT 0.189 ACAAAA 0.172 AATAAT 0.195 TTTATT


next up previous contents
Next: Library Diversity and Complementarity Up: Results Previous: Comparison Sequences   Contents
Peter T. Hraber 2001-06-13