Proof of Concept

The goal of our proof of concept was to determine if our software program is able to predict the relative abundances of individual species and genera when given environmental conditions. To conduct our proof of concept, we performed 16S sequencing on 12 different soil samples. When collecting these samples, we took measurements of the environmental conditions associated with each sample. These input values ranged from soil moisture content to soil temperature to even soil nitrogen content. After sequencing, we tested our software by comparing measured metagenomic information to predictions provided by the AI/ANN model using an algorithm which determines the probability that a blind model would make similarly accurate predictions. Through this analysis, we were able to confirm that our software is more effective than a randomized blind algorithm and is actually predicting its output. Please see below to read more about our (1) experimental process and (2) results.




Experimental Process:

1. Selection of 16S PCR Primers



Picking the optimal primer set is a crucial step when setting up 16S rRNA sequencing. For our project, we depended heavily on our review of the literature when deciding which primer set to utilize. Among the 22 papers we read, 16 of them conducted 16S rRNA gene sequencing specifically for the purpose of bacterial diversity determination in soil. The purpose of these papers varies from comparing 16S primer pairs in silico to actually utilizing 16S technology in soil samples. After carefully studying these papers, we decided that our primer set should aim to amplify the V3-V4 amplification region. The V3-V4 hypervariable region has been mentioned by 17 out of 22 papers we read on testing soil samples obtained from various regions using different sequencing platforms. In an evaluation of four 16S primers pairs, one paper concluded that “341f/785r [primers targeting the V3-V4 region] detected the highest bacterial diversity, broadest taxonomic coverage, and provided the most reproducible results” (Thijs et al. 2017). Other researchers have also seemed to reach a consensus that primer pairs amplifying the V3-V4 region are “preferred” as they are accurate and have been used extensively across the field (Wang et al. 2018, Xia et al. 2019). Orwin et al. also stated that due to the popularity of these primer pairs, “the extensive reference databases for the V3-V4 region of the microbial 16S rRNA gene have allowed information about soil bacterial communities to be obtained at a much finer taxonomic resolution” (Orwin et al. 2018). Using a widely used primer pair (341f/785r) not only provides us with many protocols from scientific literature to reference for our experimental design, but also aids our data analysis through the existing reference databases for our region of interest. Among the primer pairs we found targeting the V3-V4 region, the primer pair (f 5’- CCTACGGGNGGCWGCAG –3’) and (r 5’- GACTACHVGGGTATCTAATCC -3’) was used by 10 papers in addition to the Illumina 16S Metagenomic Sequencing Library Preparation Protocol (Illumina). Due to the accuracy of these primers and their consistent use in the literature, we selected this primer pair amplifying the V3-V4 region of the 16S rRNA gene for our own 16S rRNA sequencing experiments.

2. Soil Sample Collection



  1. The soil samples were collected along the Matoaka Trails surrounding Lake Matoka, Williamsburg, Virginia. These trails cover a range of soil types including Alluvium, Norfolk Formation, Windsor Formation, Bacon’s Castle Formation, Sedley Formation, Yorktown Formation, and Saint Mary’s Formation.
  2. At each soil digging site, soil parameters including coordinates, sample collection time, latitude and longitude (GPS), sample depth, elevation, altitude, humidity, soil temperature, air temperature, and pH were collected.
  3. Using a washed shovel, we collected the soil into a plastic collection bag. We shoveled the soil horizontally to ensure the entire soil sample came from a uniform depth.
  4. Moisture content was measured using an oven-baked method. Weights of the samples were measured before and after oven-drying.
  5. Portions of the samples were sent to Dr. Randolph Chambers for measuring the sample’s organic content.


  • Please download our protocol for soil sample collection here: Soil Sample Collection Protocol.
  • Please download our metadata from soil sample collection here: Soil Sample Collection Metadata.
  • 3. DNA Soil Extraction



    1. Sieve Soil
    2. Two replicates for each soil sample, one extracted using DNAeasy Soil Extraction Kit, one extracted using NEW DNeasy® PowerSoil® Pro Kit
    3. Nanodrop/Gel Confirmation

    Please download our protocols for DNA soil extraction here:

    4. 16S PCR



    1. Two settings were used (see Table 1 below)
    2. Source Initial Denaturation Denaturation Primer Annealing Extension Final Extension
      Settings from Sinclair, Bertilsson, & Eiler 2015 95°C for 5s 20 cycles at 95°C for 40 s 53°C for 40s 72°C for 60s 72 °C for 7 min
      Modified Settings 98°C for 30s 20 cycles at 98°C for 10s 53°C for 40s 72°C for 60s 72 °C for 7 min

      Table 1: Top row contains PCR settings from Sinclair et al. 2015; bottom row contains PCR settings modified from Sinclair et al. 2015.

    3. PCR Purification with NEB Monarch Kit.
    4. Purified PCR products were sent for sequencing.We sent a total of 15 samples for sequencing. 12 of these were different samples. Three were duplicates of the other 12. Samples 1, 6, and 10 were all sequenced twice.
    5. Please download our protocol for 16S PCR here: PCR for 16S rRNA Sequencing Protocol

    5. Software Validation



    To validate the accuracy of the chassEASE software package, we first generated a metagenomic subset of QIITA using the standard process which contained parameters we accumulated: metagenome_type, sample_time, season, latitude, longitude, depth, elevation, biome_class, land_use, moisture, humidity, soil_temperature, environment_temperature, ph, total_carbon, total_nitrogen, total_phosphorus, and carbon_nitrogen_ratio.

    To determine the probability that chassEASE predictions are more effective than random chance, we first used the QITTA database to create probability distributions of the relative abundance for each genus and species. These can be used to determine the probability that any individual prediction of a species or genera RA value would be generated by random chance.

    For each genus and species value in our proof of concept samples, we plotted the ANN/AI prediction and the true relative abundance value determined using 16s sequencing. The area between these points is the probability that the prediction would have been created from random chance. We ran this analysis for each genus and species value in each proof of concept sample, and took the average to determine an average probability of error.

    Results:


    Between our all of the species present in our 15 sequenced samples, there is an average probability of error of 28.3%. Considering genera predictions, there is an average probability of error of 15.9%. These results indicate that our software is predicting the presence of these bacterial species and genera as opposed to simply guessing.

    If you are interested in the breakdown of our results by individual species and genera, please see our attached spreadsheets:

    The goal of our proof of concept was to determine if chassEASE is able to draw predictions based on environmental input data. Through this process, our team demonstrated that our software is able to accomplish this goal.


    If you would like to view the individual 16s relative abundance results for each of our soil samples, you can access these csv files at the links below:

    References



    • Illumina. (n.d.). 16S Metagenomic Sequencing Library Preparation. Illumina.com. Retrieved October 9, 2022, from https://support.illumina.com/documents/documentation/chemistry_documentation/16s/16s-metagenomic-library-prep-guide-15044223-b.pdf
    • Orwin, K.H., Dickie, I.A., Holdaway, R., & Wood, J.R. (2018). A comparison of the ability of PLFA and 16S rRNA gene metabarcoding to resolve soil community change and predict ecosystem functions. Soil Biology and Biochemistry, 117, 27-35. https://doi.org/10.1016/j.soilbio.2017.10.036.
    • Sinclair, L., Osman, O. A., Bertilsson, S., & Eiler, A. (2015). Microbial community composition and diversity via 16S rRNA gene amplicons: evaluating the illumina platform. PloS one, 10(2), e0116955. https://doi.org/10.1371/journal.pone.0116955
    • Thijs, S., Op De Beeck, M., Beckers, B., Truyens, S., Stevens, V., Van Hamme, J. D., Weyens, N., & Vangronsveld, J. (2017). Comparative Evaluation of Four Bacteria-Specific Primer Pairs for 16S rRNA Gene Surveys. Frontiers in microbiology, 8, 494. https://doi.org/10.3389/fmicb.2017.00494
    • Wang, F., Men, X., Zhang, G., Liang, K., Xin, Y., Wang, J., Li, A., Zhang, H., Liu, H., & Wu, L. (2018). Assessment of 16S rRNA gene primers for studying bacterial community structure and function of aging flue-cured tobaccos. AMB Express, 8(1), 182. https://doi.org/10.1186/s13568-018-0713-1
    • Xia, X., Zhang, P., He, L., Gao, X., Li, W., Zhou, Y., Li, Z., Li, H., & Yang, L. (2019). Effects of tillage managements and maize straw returning on soil microbiome using 16S rDNA sequencing. Journal of integrative plant biology, 61(6), 765–777. https://doi.org/10.1111/jipb.12802