The libraries of sequences analyzed for diversity were previously described in Section 2.3.3. Two new libraries were added, to compare diversity inferences made from different libraries obtained from similar tissues, both of which were prepared and sequenced at the Samuel Roberts Noble Foundation (NF): an axenic developing root library (NF root), and a nodule-forming developing root library, with the nitrogen-fixing bacterium S. meliloti added (NF nod). These sequences were retrieved from GenBank using the Entrez query interface [13,126,127].
The new libraries were subjected to the same filtering and screening methods described previously to prepare sequences for analysis. That is, filtering trimmed poly-A and poly-T regions to be no longer than 13 nucleotides, and screening removed any non-protein coding ribosomal DNA sequences and remaining sequences shorter than 300 nucleotides. From an initial collection of 2819 raw sequences (1,130,942 nt) in the developing root library and 2924 sequences (1,579,301 nt) in the nodulating library, 2142 (995,916 nt) and 2689 (1,514,777 nt) remained, respectively, after filtering and screening.
To facilitate diversity comparisons between libraries we analyzed only those libraries that contained 2000 or more transcripts after filtering and screening. A total of six libraries met these criteria, summarized in Table 3.2. These libraries all represent Medicago truncatula root tissues, in either pure culture or a mixed-culture treatment. Axenic root libraries were the KV0 and NF root libraries. Mixed-culture treatements include mycorrhizal roots (MHAM), Phytophthora medicaginis-infected roots (DSIR), and two nodulating root libraries (KV3 and NF nod).
No attempt was made to screen redundant transcripts from a library. Indeed, the redundancy of transcripts is useful information for estimating diversity. A normalized library would not yield reliable diversity estimates, as diversity increases monotonically with sample size, until reaching the limit of total library diversity.
For privately funded sequencing projects, it is not uncommon to withhold some select transcripts when making the bulk of sequence data publicly available. Though we have not attempted to evaluate whether this practice would bias diversity inferences, we do not think it should dramatically affect the result, as the samples withheld could be considered unsampled. However, the practice of withholding sequences violates the assumption of random sampling. Inferences made from libraries where the practice of non-disclosure is common should be treated with skepticism.
If more genes are expressed in mixed, symbiotic culture than in pure, axenic tissues, we expect diversity estimates to be greater in mixed libraries than purely-cultured libraries. However, as shown in Section 2.4, some mixed-culture libraries conspicuously lack symbiont transcripts, and therefore are less likely to contain more distinct transcripts than the axenic analogue.