next up previous contents
Next: Thresholding and Quasispecies Diversity Up: Discovering Molecular Mechanisms of Previous: Appendices   Contents


Appendix A
Sampling and Diversity Estimates

The intent is to derive an analytic expression for accumulation curves given any abundance distribution of transcripts. The expression should have asymptotic behavior, with number of distinct samples approaching the total diversity in the collection as a function of increasing samples. We begin by defining the sampling process that describes the relative abundance of transcripts as a sample from a larger collection (the library), then derive a general expression that relates accumulation curves to the abundance distribution of transcripts.

We will use the following notation for abundance distributions: $x$ is an abundance class. Singletons (only one individual of a type sampled) correspond to $x=1$, duplicates to $x=2$, etc. Note that $x$ may be either discrete or continuous, but $x \ge 0$. In either case, $x$ indicates a size class and some function on $x$ defines the abundance distribution. Several abundance distributions merit interest: $G(x)$ is the abundance distribution of genes in a genome; $g(x)$ is the distribution of transcripts expressed in a particular tissue type or developmental stage, or in a cDNA library; $f(x)$ is a sample of transcripts drawn from $g(x)$. The total number of transcripts is $N = \sum _x x g(x)$, or $n = \sum _x x f(x)$ in a sample; total diversity is $S_{tot} = \sum _x g(x)$, and observed diversity $S_{obs} = \sum _x f(x)$, For convenience, we also define the sampling proportion $r = n/N$, and proportion of sampled diversity $q = S_{obs}/S_{tot}$.

Now, consider a library having $N$ transcripts that represent $S$ genes with abundances $n_1, n_2, n_3, ..., n_S$. The probability of sampling $y$ transcripts from gene $i$ in a sample of size $n$ is given by the hypergeometric distribution:


\begin{displaymath}
P(y) = \frac{{n_i \choose y}{N - n_i \choose n-y}}{{N \choose n}}, 0 \le y \le n_i.
\end{displaymath}

Note that ${N \choose n}$ is the binomial coefficient, or the combinatorial ``choose'' expression that describes how many ways one can sample $n$ objects from a collection of size $N$ [44,118].

Repeated samples of size $n$ from the distribution $g(x)$ will yield the following:


\begin{displaymath}
f(y,x) = g(x) \frac{{x \choose y}{N-x\choose n-y}}{{N \choose n}}, 0 \le y \le x.
\end{displaymath}

This expression gives the transformation from $g$ to $f$ in the sampling process [38].

How are $f$ and $g$ related? If we take the limit of increasing $n$ while maintaining a constant ratio $\lambda = xn/N = rx$, the previous expression converges to the Poisson distribution $e^{-\lambda}(\lambda)^y/y!$ [44,38]. Substituting notation yields $P(y)=e^{-rx}(rx)^y/y!$.

Thus, we can represent the distribution of individuals in a sample as:


\begin{displaymath}f(y) = \int [e^{-rx}(rx)^y/y!] g(x) dx \end{displaymath}

Note that this expression is general with regard to the underlying distribution of transcript abundances $g(x)$. That is, $f$ is a rescaling of $g$ by a factor of $1/r$, or the inverse proportion of the population that has been sampled. As $n \rightarrow N$, $r
\rightarrow 1$, and $f \rightarrow g$. Thus, increased sampling converges on the underlying distribution from which samples are drawn. The two differ in that the largest abundance class $x_{max}=X$ in the underlying distribution $g(x)$ corresponds to a largest abundance class $y_{max}=rX$ in $f(y)$ [38].

The next step is to relate the functional form of an accumulation curve to the distribution of abundances in general. The idea is to update $g \rightarrow g(x,n)$ as the number of transcripts not sampled by step $n$. Note that $N = \int x g(x) dx$.

At each step $n$, we update $g(x, n+1) = g(x) - \Delta g(x,n)$, where $\Delta g(x) = 1$ iff sampling a gene of abundance $x$ at $n$. The probability of $\Delta g(x)$ being $1$ is $xg(x,n)/N$, so the expectation is $\langle \Delta g(x,n)\rangle = -\frac{x}{N} g(x,n)$. Iterating $g(x,n)$ by its expectation, we obtain:


\begin{displaymath}
\langle g(x, n+1)\rangle = \langle g(x,n)\rangle - \langle \...
...gle = \langle g(x,n)\rangle - \frac{x}{N}\langle g(x,n)\rangle
\end{displaymath}

or


\begin{displaymath}
\frac{\partial g(x,n)}{\partial n} = -\frac{x}{N}g(x,n)
\end{displaymath}

the solution of which is $g(x,n) = g(x) e^{-nx/N} = g(x) e^{-rx}$.

Now let the accumulation curve $u(n)$ be the number of new transcripts seen by step $n$. The probability that $u(n+1) = u(n) + 1$ is the sum of the probabilities of seeing a gene of abundance $x$ in $g(x,n)$, or


\begin{displaymath}
\int \frac{x}{N} g(x,n) dx
\end{displaymath}

Then $\langle u(n+1) - u(n)\rangle = 1$ is

\begin{displaymath}
\int \frac{x}{N} g(x,n) dx = \langle \frac{\partial u}{\partial n}\rangle
\end{displaymath}

Substituting the above solution,


\begin{displaymath}
\frac{\partial u}{\partial n} = \int \frac{x}{N} g(x) e^{-nx/N} dx
\end{displaymath}

Thus, for any abundance distribution $g(x)$, the accumulation curve $u$ has the general form:


\begin{displaymath}
u(n) = \int g(x) dx - \int e^{-rx} g(x) dx
\end{displaymath}

Using the Taylor series expansion for small $rx$, we have


\begin{displaymath}
e^{-rx} = 1 - rx + \frac{1}{2!}(rx)^2 - ...
\end{displaymath}

so


\begin{displaymath}
u(n) = \int rx g(x) dx - \frac{1}{2!} \int (rx)^2 g(x) dx + \frac{1}{3!} \int (rx)^3 g(x) dx - ...
\end{displaymath}

or


\begin{displaymath}
u(n) = \int \frac{nx}{N} g(x) dx - \frac{1}{2!} \int (\frac{...
...)^2 g(x) dx + \frac{1}{3!} \int (\frac{nx}{N})^3 g(x) dx - ...
\end{displaymath}

Substituting $N = \int x g(x) dx$, we obtain


\begin{displaymath}
u(n) = n - \frac{n^2}{2!} \frac{\int x^2 g(x) dx}{(\int x g(...
...rac{n^3}{3!} \frac{\int x^3 g(x) dx}{(\int x g(x) dx)^3} - ...
\end{displaymath}

or simply


\begin{displaymath}
u(n) = n - \frac{n^2}{2!} \int g(x) dx (x/N)^2 + \frac{n^3}{3!} \int g(x) dx (x/N)^3 - ...
\end{displaymath}

Thus, it can be seen that the form of the accumulation curve $u(n)$ depends on $g(x)$.


next up previous contents
Next: Thresholding and Quasispecies Diversity Up: Discovering Molecular Mechanisms of Previous: Appendices   Contents
Peter T. Hraber 2001-06-13