We have applied diversity estimators developed to predict species diversity in ecological communities to predict diversity of transcripts in tissue-specific cDNA libraries. However, the predictions derived here seem likely to be overestimates of the diversity of expressed transcripts, even though we think that the estimators are valid when we know the true degree of diversity. The predictions for transcript diversity in plant roots might best be interpreted as an upper bound. This is because predicted diversity of M. truncatula root transcripts is greater than all genes in Arabidopsis thaliana [8]. How can we account for these apparent overestimates of transcript diversity?
Expressed sequence tags are variable in both length and quality. Because of how libraries are prepared, it is not uncommon for multiple fragments from the same mRNA to be present in a library. These fragments do not necessarily overlap, which would yield two distinct tags from the same mRNA, or two fragments from the same transcript. The method of grouping transcripts into quasispecies used in this study cannot account for two non-overlapping transcripts (or even partially overlapping fragments) from the same mRNA.
Of the sequences analyzed, a very small proportion were obtained from sequencing reactions using the T7 primer, which hybridizes to the sequencing vector at the 3' end of an insert, rather than the 5' end. This proportion was greatest in the KV0 and KV3 libraries, in which 13 of 2491 (0.5%) and 11 of 2173 (0.7%), respectively, were sequenced with the T7 primer. Because no assembly of the resulting cDNA fragments was attempted, it is likely that separate fragments, isolated from the same gene transcript, are not grouped with their 5' counterparts. This would result in a slight overestimate of transcript diversity, which could be corrected either by removing those sequences obtained from the 3' end of a transcript, or possibly by performing a sequence assembly prior to analysis, where complementary reads from the same clone are known.
Alternate splice variants, transcripts derived from the same gene but edited after transcription to produce different proteins, would result in greater transcript diversity than gene diversity. The prevalence of splice variants among human transcripts has only recently been appreciated [16,52]. The importance of splice variants in Medicago and other plant species is unknown [8]. Thus, the presence of alternate splice variants could have caused our diversity estimates of transcripts to exceed those of genes, though to an unknown extent. Should the genome of M. truncatula be sequenced, the relation of these predictions to diversity estimates from a genome census for protein-coding regions would be interesting and informative.
The degree of stringency required to identify two sequences as redundant, and also the sensitivity of the means used to detect two related transcripts, will have a considerable effect on resulting diversity estimates. The technique used here is quite conservative. A more sensitive, less stringent method would likely result in lower diversity estimates.
In general, the utility of this approach as a decision support for when to normalize a library whose sequencing is in progress depends on the cost to sequence a sample, balanced against the benefit of discovering a previously unseen transcript quasispecies. As the slope of an accumulation curve approaches zero, the cost to discover a new type of transcript continues to increase. Indeed, we can view the slope of a particular accumulation curve as a per-sample discovery rate. For instance, a slope of 1/96 implies a yield of one new transcript per 96 samples sequenced. The exact point of diminishing returns (yield per sampling effort) depends on the sequencing cost and the value placed upon sequencing a new type of transcript.
We were able to quantify the extent of overlap in types of transcripts obtained from different libraries. However, the degree of complementarity observed between libraries does not agree with our intuitions. Why don't axenic and nodulating root libraries cluster together? It is possible that the two libraries were prepared from different, distantly related, strains. However, it seems unlikely that this alone would account for the differences seen.
Unfortunately, the quality of a sequence read from a chromatogram trace varies from base to base and from one run to another [1]. We do not know the quality scores of all the bases of any of the sequences we have analyzed. One possible cause of divergence is a significant level of bases inaccurately identified by the base-calling algorithm that was used to read a chromatogram.
Why are the DSIR and MHAM libraries more similar in composition than libraries independently derived from similar plant tissues? Do both libraries contain some defense-response genes that are up-regulated during both pathogen infection and colonization by a mutualist? A more detailed consideration of which transcripts are common to which libraries might help elucidate the disparity between our expectations and observations.
There remains an unresolved issue in inferring complementarity using estimated diversity: we do not know how to infer complementarity from estimated diversity [32]. The complementarity values reported above were calculated from observed diversity. As more samples are sequenced, complementarity will presumably decrease. However, we cannot make this comparison without extrapolating into the unknown, and the form of such extrapolations remains to be understood [32].
Can we use this approach to estimate global transcript diversity? Theory suggests this might provide an alternative, bottom-up technique to determine how many genes are in a genome. In practice, this might prove unreliable. Calibration with a genome that has already been sequenced and assembled, and for which several different cDNA libraries have been sequenced, would verify the utility of this approach.