next up previous contents
Next: Estimated Diversity in Empirical Up: Results Previous: Results   Contents


Validation

Which estimators accurately predict diversity?

Figure 3.4: Diversity estimates as a function of sample size. Accumulation curves illustrate observed and estimated transcript diversity as a function of increasing samples, for four validation frequency distributions: (A) Poisson, (B) exponential, (C) log-normal, and (D) negative binomial. Estimators are described in Section 3.3.1. Samples were of n=500 individuals, drawn with replacement between samples. Diversity was averaged from m=20 independently resampled replicates, in which the sampling order was permuted. Standard deviation is indicated by vertical impulses. All axes are shown on the same scale, for comparison.

Figure 3.5: Estimator accuracy validation. Boxplots show estimator accuracy for four validation distributions and four estimators: (A) ACE, (B) ICE, (C) Chao 1, and (D) Michaelis-Menten. References for each estimator are provided in Section 3.3.1. Estimator accuracy (delta) is measured as $(\hat{S} - S)/S$, where $\hat{S}$ is estimated diversity and $S$ is true diversity. Validation distributions are lettered a-d as described in Section 3.3.1.


Table 3.1: Estimator accuracy test results. For each of the four validation distributions (a-d), estimator accuracy was measured as the deviation from true diversity, scaled by true diversity: $\delta
= (\hat{S} - S)/S$. Mean and standard deviation ($s$) of $\delta$ were calculated from 20 replicate sampling simulations, in which 500 individuals were sampled before calculating estimators. To identify reliable estimators, two two-tailed univariate tests were performed of the null hypothesis that $\delta = 0$, namely Student's $t$, and the non-parametric Wilcoxon rank-sum test, $W$. Rejection of the null hypothesis indicates the estimator is inaccurate. $P$-values were interpreted to indicate significant differences at 95% (*) and 99% (**) experiment-wide confidence levels ( $\alpha'=\alpha/20$), with $\nu =19$.
DISTRIBUTION $\delta$ $s$ $t$ $P$ $W$ $P$
ACE ESTIMATOR
a 0.00326 0.00402 3.630 0.0018* 192 0.0005**
b 0.00035 0.00161 0.980 0.3395 127 0.4221
c 0.00379 0.00367 4.620 0.0002* 193 0.0004**
d -0.00058 0.00126 -2.068 0.0526 56 0.0702
ICE ESTIMATOR
a 0.00624 0.00406 6.866 0.0001** 209 0.0001**
b 0.00113 0.00163 3.126 0.0056 173 0.0094
c 0.00554 0.00375 6.607 0.0001** 203 0.0001**
d -0.00034 0.00126 -1.201 0.2444 76 0.2789
CHAO1 ESTIMATOR
a -0.00211 0.00394 -2.391 0.0273 50 0.0399
b -0.00003 0.00187 -0.082 0.9360 103 0.9563
c -0.00123 0.00476 -1.158 0.2614 84 0.4524
d -0.00039 0.00132 -1.329 0.1995 72 0.2250
CHAO2 ESTIMATOR
a -0.00219 0.00388 -2.529 0.0204 44 0.0215
b -0.00004 0.00185 -0.094 0.9258 102 0.9273
c -0.00124 0.00479 -1.154 0.2627 85 0.4749
d -0.00039 0.00132 -1.329 0.1997 72 0.2250
MM ESTIMATOR
a 0.23480 0.00702 149.674 0.0001** 210 0.0001**
b 0.17495 0.00575 136.163 0.0001** 210 0.0001**
c 0.20930 0.00430 217.704 0.0001** 210 0.0001**
d 0.12596 0.00339 165.916 0.0001** 210 0.0001**

Accumulation curves (Figure 3.4) illustrate observed and estimated transcript diversity as a function of increasing samples. Accumulation curves also indicate biases for small samples which diminish rapidly for most estimators. Estimator variance is low. The ACE and ICE estimators were correlated, as were Chao 1 and Chao 2, so we generally refer to ACE and Chao 1 (Figure 3.4).

Least biased estimators were Chao 1 (Figure 3.5C and Table 3.1) and Chao 2. The coverage estimators ACE and ICE seem unbiased, and converge rapidly on the diversity limit (Figure 3.4). However, formal tests indicate these estimators were accurate in only two of four test distributions (Table 3.1): b, exponential and d, negative binomial. The latter is consistent with Chao and Lee's investigations [29] using the negative binomial distribution. Coverage estimators overestimate diversity by about 0.4% to 0.8% (Figure 3.5A and B). This is consistent with Colwell and Coddington's evaluation of empirical data from species counts in seed-bank samples [32].

Biased estimators were Jackknife 1, Jackknife 2, and Bootstrap (not shown). The Michaelis-Menten maximum likelihood estimator was inaccurate (Figure 3.5D and Table 3.1), overestimating diversity by 12% to 24%. The Michaelis-Menten model is a parametric estimator [32,94,112], and depends on the transcript frequency distribution having properties that are not necessarily satisfied in all cases, as demonstrated in Appendix A.

The preferred estimators in this case are Chao 1 and Chao 2, with possible consideration of ACE and ICE. What do they tell us about transcript diversity in pure and mixed tissue cultures?


next up previous contents
Next: Estimated Diversity in Empirical Up: Results Previous: Results   Contents
Peter T. Hraber 2001-06-13