Diversity estimators were used to infer alpha diversity in a library.
For each cDNA library, we constructed an identity matrix, a square
matrix that summarizes the similarity of each pair of ESTs. For each
pair of sequences, we calculated the raw similarity score using BLASTN, version 2.0.6 [5]. To minimize the instances
of a spurious match between two sequences that share a conserved
functional domain or short repeat, a cutoff expect value of
was used for all comparisons between libraries.
The TBLASTX algorithm could compare sequence pairs as amino acids rather than nucleotides, but a nucleotide comparison is more appropriate in this application because silent nucleotide substitutions in different genes could yield related amino acid translations, yielding spurious positive matches [48,70], and because TBLASTX is less likely to insert a single nucleotide gap, which can disrupt the reading frame for an amino acid translation, but which occurs frequently in EST sequencing projects [108].
Appendix B discusses the consequences of using a lower cutoff expect value and similarity searching with TBLASTX.
Normalizing the similarity score for a match against the score from matching a sequence with itself yields identity values from 0 to 100%. Percent identity values indicate the relative similarity of transcripts but do not distinguish whether two transcripts are duplicates. Thresholded identity values provide this distinction. Thresholding remaps identity values to one if the value is above the threshold, indicating the two transcripts are duplicates, and to zero if below, to indicate two distinct transcripts. A recent study used a similar approach to identify redundant transcripts among ESTs [66]. Figure 3.3 illustrates an identity matrix for sequences from the MHAM mycorrhizal root library, before and after thresholding.
![]() |