next up previous contents
Next: Library Complementarity and Similarity Up: Methods Previous: Empirical Libraries   Contents


Empirical Diversity Inference

Diversity estimators were used to infer alpha diversity in a library. For each cDNA library, we constructed an identity matrix, a square matrix that summarizes the similarity of each pair of ESTs. For each pair of sequences, we calculated the raw similarity score using BLASTN, version 2.0.6 [5]. To minimize the instances of a spurious match between two sequences that share a conserved functional domain or short repeat, a cutoff expect value of $E \le 10^{-10}$ was used for all comparisons between libraries.

The TBLASTX algorithm could compare sequence pairs as amino acids rather than nucleotides, but a nucleotide comparison is more appropriate in this application because silent nucleotide substitutions in different genes could yield related amino acid translations, yielding spurious positive matches [48,70], and because TBLASTX is less likely to insert a single nucleotide gap, which can disrupt the reading frame for an amino acid translation, but which occurs frequently in EST sequencing projects [108].

Appendix B discusses the consequences of using a lower cutoff expect value and similarity searching with TBLASTX.

Normalizing the similarity score for a match against the score from matching a sequence with itself yields identity values from 0 to 100%. Percent identity values indicate the relative similarity of transcripts but do not distinguish whether two transcripts are duplicates. Thresholded identity values provide this distinction. Thresholding remaps identity values to one if the value is above the threshold, indicating the two transcripts are duplicates, and to zero if below, to indicate two distinct transcripts. A recent study used a similar approach to identify redundant transcripts among ESTs [66]. Figure 3.3 illustrates an identity matrix for sequences from the MHAM mycorrhizal root library, before and after thresholding.

Figure 3.3: Example identity matrix for 3017 transcripts from the MHAM mycorrhizal root library. Above the main diagonal, a point occurs where a BLASTN search indicated a pair of transcripts matched one another. To demonstrate the effect of thresholding, points below the diagonal indicate pairs of transcripts having an identity score (ratio of match score to self score) of 90% or greater.
\begin{figure}\begin{center}
\epsfig{file=diversity/figures/identity.ps,width=4in}\end{center}\end{figure}


next up previous contents
Next: Library Complementarity and Similarity Up: Methods Previous: Empirical Libraries   Contents
Peter T. Hraber 2001-06-13