next up previous contents
Next: Methods Up: How Many Genes? Transcript Previous: Synopsis   Contents

Introduction

Genome-scale sequencing technologies enable careful enumeration of the number of protein-coding genes in a genome, as well as non-coding regions [1,8,67,102,122]. Even with the advantages of a whole-genome perspective, there exists considerable variation in the predicted number of genes in a genome [10,69]. In the human genome, for instance, even with the benefit of considerable volumes of sequence data, predictions varied from 28,000 to 120,000 and more [36,42,46,71]. At least two sources of error complicate a genome census for protein-coding regions: gene prediction methods vary in their accuracy, and the notion of what constitutes a gene has several meanings [10,101]. The prevalence of alternatively-spliced transcripts is another complication to predicting gene diversity [16,52].

In contrast to genome sequencing, techniques that isolate and sequence transcripts expressed in particular tissues or at varied developmental stages provide a more contextual view of those genes that are expressed in vivo [1,2,34,45,58,60,82,83,93]. For organisms whose genomes have not yet been, or may never be, fully sequenced and assembled, it is informative to estimate how many genes are expected in a genome [2]. Indeed, according to Lewin ([69], page 657), ``The major question about eukaryotic DNA concerns the number and types of genes in a genome''. This chapter describes a way to infer gene diversity given redundancy rates from random sampling of EST data.

Ultimately, gene expression technologies are the most specific way to quantify the abundance of transcripts, and provide a convenient way to study the dynamic response of expression levels to varied experimental conditions [22,68,74,95,114,130]. However, hybridization-based assays require advance knowledge of what transcribed sequences will be found in a sample, so that complementary probes can be synthesized that will anneal to the targeted transcripts [72,74], or so that oligonucleotides prepared from cDNAs can be fixed to a substrate [39,95,115]. This primary sequence data is almost always acquired from sequencing plasmids from randomly-picked clones prepared from a cDNA library.

Because of the random selection process, redundant transcripts are often sequenced. EST sequencing efficiency could be improved by knowing a priori how many genes are present in a library. Strategies to eliminate redundancy typically use normalization via hybridization [18,80,109]. A hybridization normalization strategy may either screen the entire library against itself and eliminate the redundant clones, or may sequence a part of the library and then hybridize common transcripts to the remainder of the library, to identify and remove the most common transcripts [2,18,109]. Either approach could reduce the diversity in a library due to cross-hybridization and simple mass loss from manipulation, meaning that some transcripts will never be sequenced because they are eliminated from the library [2,18,80]. The approach described in this chapter could provide decision support during sequencing, by identifying when sequencing efforts will yield a diminished return of novel transcripts per sample sequenced, and by how much.

How many species are in a community, given that not every individual in the population has been sampled? Species accumulation curves, also called collector's curves, summarize the number of distinct transcripts as a function of sample size [23,32,112]. In the limit of a very large number of samples, an accumulation curve asymptotically approaches the total diversity in the collection. This general approach has been used to infer diversity via sampling from unknown abundance distributions in a variety of situations, including the diversity of species in the fossil record [79,96], species richness in ecological communities [64,76], and to estimate the size of an author's vocabulary [23].

Sampling gene diversity in a cDNA library is a similar sampling process. A species, or quasispecies, at the molecular level is analogous to a gene, or a gene family, depending on stringency. A community is analogous to a library, which represents a profile of the transcript levels in the tissues (or cell types) from which it was prepared. Figure 3.1 diagrams accumulation curves for transcripts in the MHAM mycorrhizal M. truncatula root library, introduced in Section 2.3.3, at three levels of thresholding. (Thresholding is described in Section 3.3.3, below.)

Figure 3.1: Accumulation curves for transcripts from the MHAM mycorrhizal root library. The number of distinct samples is a function of the number of samples sequenced. Different threshold identity scores were applied to the same data, yielding distinct accumulation curves.
\begin{figure}\begin{center}
\epsfig{file=diversity/figures/MHAMcollect.eps,width=5.75in}\end{center}\end{figure}

A variety of estimators, both parametric and non-parametric, have been developed and tested for use by ecologists. A comprehensive review of estimators, and comparison with alternative inferential methods, is given in [32].

Relating total gene diversity at the genome level with tissue-specific, expressed diversity is similar to the problem of quantifying global species diversity from measurements of local diversity in ecological communities. There, diversity is understood to have at least two components: the total number of species in a large habitat, or gamma diversity, and the number of species in a local community, or alpha diversity [100,129]. The two are equated by multiplying average local diversity by the average degree of turnover between habitats, or the beta diversity, a dimensionless parameter [100,129]. This approach might be used to infer the number of genes in a genome (as a lower bound) given sequences from a large number of tissue-specific libraries.


next up previous contents
Next: Methods Up: How Many Genes? Transcript Previous: Synopsis   Contents
Peter T. Hraber 2001-06-13