I extracted the sets of human V-gene sequences from IMGT, the
international ImMunoGeneTics database (1997). In cases
where multiple alleles were given for a certain locus, I only
considered the first one in the database. I used the CDR/FR assignment
given in the IGMT alignments. As we were only interested in the
properties of the germline genes, I restricted the analysis to FR1,
FR2, FR3, and CDR1, CDR2 fragments. CDR3 is not entirely encoded by
germline genes, but contains some non-templated nucleotides, and I
therefore left it out of these calculations. As shown in Fig.
, all human V region sequences, heavy as well as light
chains, have higher average replacement mutability of CDR nucleotides
than of FR nucleotides.
To investigate what compositional biases are responsible for the difference in FR/CDR mutability, for each of the germline sequences I constructed a number of variant sets: Sets of variants of identical nucleotide composition, codon composition, or amino acid sequence. For illustration, let us focus on one initial data set, the set of human VH sequences. For each sequence in this set, the set of sequences with identical FR/CDR nucleotide composition can be obtained by permuting the nucleotides in FR and CDR, separately. I constructed 105 such variants for each of the germline VH sequences. If the mutability of a sequence is completely determined by the relative proportion of the nucleotides in the sequence, then we expect that, by permuting the position of the nucleotides in the sequence, its mutability will not be affected. As I will show below, this is not the case. However, for some sequences, the average FR mutability of the set of variants is already lower than their average CDR mutability. Therefore the nucleotide composition does play a role in the differential mutability of FRs and CDRs, although it does not completely explain it.
Similarly, I construct variants of a sequence that have the same codon frequencies in FR and CDR, by permuting the codons, separately, in the two regions. Using this data set I investigate whether the mutability of the sequence is completely determined by its codon composition. If this is the case, the mutability of the germline sequence will not be significantly different than the average over its set of variants. If, on the other hand, the linkage of codons also plays a role, the germline sequence will have a significantly different mutability than the average over its set of variants. Whether the codon composition is sufficient to explain the CDR-FR mutability difference is also a matter of intense debate. My study is the first one that appropriately addresses this question. This is again due to the capability of constructing variants of the sequence whose mutability can be calculated.
Finally, to obtain variants of a sequence with the same translation, I first determined the amino acid sequence encoded by the germline gene. I then generated new nucleotide sequences encoding the same amino acid sequence as follows. For each amino acid, I choose, with uniform probability, one of the codons that could encode it, and add it to the nucleotide sequence. This process is repeated for all amino acids in the protein sequence. For each of the germline sequences, I constructed a set of 105 such variants, which I call translationally-neutral (that is, variants with the same amino acid translation). I then used this set to test whether the mutability of the sequence is optimized through codon bias.
Having these variant sets, I could proceed to analyze the mutability of FRs and CDRs in individual V region sequences.