We will use the following notation for abundance distributions:
is
an abundance class. Singletons (only one individual of a type
sampled) correspond to
, duplicates to
, etc. Note that
may be either discrete or continuous, but
. In either case,
indicates a size class and some function on
defines the
abundance distribution. Several abundance distributions merit
interest:
is the abundance distribution of genes in a genome;
is the distribution of transcripts expressed in a particular
tissue type or developmental stage, or in a cDNA library;
is a
sample of transcripts drawn from
. The total number of
transcripts is
, or
in a
sample; total diversity is
, and observed
diversity
, For convenience, we also define
the sampling proportion
, and proportion of sampled diversity
.
Now, consider a library having
transcripts that represent
genes with abundances
. The probability of
sampling
transcripts from gene
in a sample of size
is
given by the hypergeometric distribution:
Note that
is the binomial coefficient, or the
combinatorial ``choose'' expression that describes how many ways one
can sample
objects from a collection of size
[44,118].
Repeated samples of size
from the distribution
will yield the following:
This expression gives the transformation from
to
in the
sampling process [38].
How are
and
related? If we take the limit of increasing
while maintaining a constant ratio
, the previous
expression converges to the Poisson distribution
[44,38]. Substituting
notation yields
.
Thus, we can represent the distribution of individuals in a sample as:
Note that this expression is general with regard to the underlying
distribution of transcript abundances
. That is,
is a
rescaling of
by a factor of
, or the inverse proportion of
the population that has been sampled. As
,
, and
. Thus, increased sampling
converges on the underlying distribution from which samples are drawn.
The two differ in that the largest abundance class
in the
underlying distribution
corresponds to a largest abundance
class
in
[38].
The next step is to relate the functional form of an accumulation
curve to the distribution of abundances in general. The idea is to
update
as the number of transcripts not sampled
by step
. Note that
.
At each step
, we update
, where
iff sampling a gene of abundance
at
.
The probability of
being
is
, so the
expectation is
.
Iterating
by its expectation, we obtain:
or
the solution of which is
.
Now let the accumulation curve
be the number of new transcripts
seen by step
. The probability that
is the sum
of the probabilities of seeing a gene of abundance
in
, or
Then
is
Substituting the above solution,
Thus, for any abundance distribution
, the accumulation curve
has the general form:
Using the Taylor series expansion for small
, we have
so
or
Substituting
, we obtain
or simply
Thus, it can be seen that the form of the accumulation curve
depends on
.