loading

Results

Introduction

Raw sequences analysis was performed using mothur 1.44.1 version1 using SILVA 138 database for alignment and taxonomic classification2. Mothur is an open- source software package for bioinformatics data processing. The package is frequently used in the analysis of DNA from uncultured microbes. SILVA provides comprehensive, quality checked and regularly updated datasets of aligned small (16S/18S, SSU) and large subunit (23S/28S, LSU) ribosomal RNA (rRNA) sequences for all three domains of life as well as a suite of search, primer-design and alignment tools (Bacteria, Archaea and Eukarya).

Sequences were clustered at the 0.97 identity threshold for the formation of the final Operational taxonomic Unit (OTUs), which were further used for statistical analysis. In 16S metagenomics approaches, OTU or Operational Taxonomic Unit is used to classify bacteria based on sequence similarity of the 16S marker gene. Typically, OTU clusters are defined by a 97% identity threshold of the 16S gene sequences to distinguish bacteria at the genus level.

Results on microbial diversity were obtained for all the samples analysed. The majority of the bacterial diversity was revealed, based on Good's coverage (Table 1 - Appendix). Good's coverage is a method of estimating what percent of the total species is represented in a sample (i.e. value of 100% implies that all of the existing species have been recovered). Shannon diversity index values were ~5.5 in all samples, apart fromB8 (4.1), B10 (6.2) and B12 (6.0)(Table 1 - Appendix). The Shannon diversity index is a popular metric used in ecology. The Shannon diversity index tells you how diverse the species in a given community are. It rises with the number of species and the evenness of their abundance.

Based on the relative abundance of the found bacterial species, the ones with abundant (>10%) and common (1-10%) occurrence (Appendix) are most likely to represent active members of the bacterial communities in each sample (Fig. 1 - Appendix). Overall in all samples 3,046 species were detected. Amongst them, only 82 species were classified as common and/or abundant (Appendix).

The appendix contains the GenBank ID of the closest relative (column C), its taxonomy (column D), the percentage of similarity (column E), which are relatively high, and the number of samples (column F). Columns G-R correspond to the presence of each OTU in the given sample. GenBank is a comprehensive database that contains publicly available nucleotide sequences for more than 300,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects.

Measurement of 16S rRNA copies using bacterial primers can provide hints for the total bacterial cell numbers. In this case qPCR results indicated fluctuating bacterial cell concentrations ranging from 40.4 copies/mg to 1.6*106 copies/mg (Fig.1 - Appendix). Overall concentrations were much higher in samples B1-B5 exceeding concentrations at the rest of the samples for ~ 2-3 orders of magnitude (Fig.1 - Appendix).

Morisita and Bray-Curtis similarities calculated between all samples showed only one cluster (B1-B5) with intracluster similarities >70% and >55% respectively. Sample B6 was an outgroup of this cluster in both cases, while sample B8 exhibited the lowest similarities with all samples using both similarity indices (Fig. 2 - Appendix). Similar results were observed when OTU abundance data were analyzed with nMDS (Fig. 3 - Appendix).

Taxonomy

Phylum-Class Level

Taxonomic analysis at the Phylum-Class (only for Proteobacteria) level showed that the three most abundant Phyla-Classes in samples B1-B5 were Actinobacteria, Bacteroidetes and Alphaproteobacteria showing fluctuating abundances usually followed by Gammaproteobacteria, apart from B5, where the latter group slightly exceeded Alphaproteobacteria. In B6 Gammaproteobacteria was the most abundant group followed by the other three prevalent groups (Fig. 4 - Appendix).

Overall these four groups were the most abundant in all samples showing fluctuating abundances. The only exceptions, regarding the four most abundant groups, were observed in B3 and B4 where Acidobacteria and Planctomycetes exceeded Gammaproteobacteria, in B7 that was dominated by Acidobacteria, while the other four major groups followed, and in B10 where Acidobacteria were the third more abundant group (Fig. 4 - Appendix).

The most striking differences to this pattern were observed in B8, where Firmicutes appeared in relative abundances ~15% while they had never exceeded 8% in any other sample, and in B11 where members of Patescibacteria reached 19% while they were below 5% in the rest of the samples, and below 1% in B1-B6 (Fig. 4 - Appendix).

OTU Level

At the OTU Level common (>1%) and abundant (>10%) species taxonomy agreed with the major pattern observed for Phyla/Classes. The most abundant species in B1 were members of Bacteroidetes and more specifically representatives from Adhaeribacter and Pontibacter, members of Actinobacteria such as uncultured Actinomarinales, Modestobacter, Micrococcus and Rubrobacter, members of Alphaproteobacteria such as Skermanella and Sphingomonas and Massilia as the Gammaproteobacteria representative. Not surprisingly, as assumed from similarity analysis, the majority of these species, were common/abundant in samples B2-B6 as well, with uncultured Tepidisphaerales; WD2101 (Planctomycetes) soil group and Pseudomonas emerging also as abundant in samples B3 and B6 respectively (Fig. 5 - Appendix).

Amongst common/abundant species in B7 only three were shared with abundant species in B1-B6 group including Sphingomonas and Modestobacter. Other common/abundant species were shared only with B6 and belonged to Cutibacterium, while apart from shared species with previous stages several B7 common/abundant species such as Streptococcus, Methylobacterium, and Amnibacterium (Actinobacteria) emerged here and increased in further steps, while others such as representatives of Vicinamibacteraceae (Acidobacteria), or Paracoccus and Acinetobacter were prevalent only in this sample.

As it would be expected from the low similarities observed between B8 and the rest of the samples, B8 shared only five common/abundant species with other samples that belonged to Amnibacterium, Cutibacterium, Streptococcus, Staphylococcus and Methylobacterium. It has to be noticed that no shared species were detected with B1-B5 (Fig.5 - Appendix). Amongst the rest non-shared species, the most abundant belonged to Novosphingobium (Alphaproteobacteria; Sphingomonadales) and to Caldilineaceae (Chloroflexi).

Amongst the dominant B9 species, Methylobacterium, Staphylococcus, Amnibacterium, Vicinamibacteraceae and Streptococcus were shared with B7 and/or B8 while the rest that belonged either to Microscillaceae (Bacteroidetes) or to Sandaracinaceae (Myxococcota) were unique for this sample. Sample B10, which exhibited the highest Shannon diversity shared many common/abundant species with other samples with the most abundant belonging to Methylobacterium, Modestobacter and Skermanella (Fig. 5 - Appendix).

Sample B11, although it shared Sphingomonadaceae representatives, Modestobacter and Skermanella with other samples, was quite unique since it was dominated by representatives from the uncultured groups of Patescibacteria; Parcubacteria and Plantomycetes; OM190 that were completely absent from the rest of the samples. Finally, sample B12 shared Cutibacterium, Staphylococcus and Amniobacter species with previous samples, while it also contained unique common species that belonged to Microscillaceae (Bacteroidetes) and Patescibacteria; Parcubacteria.

Overall it seems that most samples exhibited striking differences between them. However, the group B1-B6 was characterized by higher similarities and by a group of 'core' bacteria that were common/abundant only in these samples and were negligible in all the rest. This group consisted of species, Adhaeribacter, Sphingomonas, Pontibacter, Massilia and Rubrobacter (Fig. 5 - Appendix).

Potential Functions

After comparing our taxonomic results with literature, it seems that the core microbiota of B1-B6 group mainly prevail in the rhizosphere of arid soils, and possess functions that include the production of melanin, the ability to retain water (Adhaeribacter), the production of exo-polysaccharides (Adhaeribacter, Sphingomonas) and of course the important genes for N2 fixation (Sphingomonas, Pontibacter). Some species (Sphingomonas) are beneficial for the plants, by inducing the production of auxin, and also protecting the plants from phytopathogens and abiotic stress3. Some other species (Sphingomonas, Massilia) also possess the ability to degrade complex organic compounds such as aromatic compounds, chitin and phenolic compounds4 while Rubrobacterales are abundant in sunlight-exposed biofilms and in irradiated areas5.

Genera that prevailed mostly in the samples after B6 included Methylobacterium, Amnibacterium, Staphylococcus, and Cutibacterium. These genera can be found in soil since they are ubiquitous in several habitats. Especially the two latter are widespread in the environment and are commonly considered as potential contaminants. The two former are commonly detected in the soil but are also encountered in animals' gut. Methylobacterium can be symbiotic and it grows by reducing one carbon compound while it is possible that it stimulates development through the production of phytohormones. Amnibacterium is found in soil but it has also been detected as an endophyte6.

Regarding genera that prevailed in the majority of samples, like Skermanella and Modestobacter, we know that the genus Skermanella is very common in soil diazotrophic communities although some studies have shown that some species do not possess genes for Nitrogen fixation7. Modestobacter includes strains with genes that are associated with stress response, including osmotic stress, resistance to UV radiation, temperature, and carbon starvation8.

Groups of bacteria that did not exhibit any specific pattern and prevailed in some or only one sample, included uncultured Vicinamibacterales; Acidobacteria, which are very common in soil and usually grow on sugar proteins or complex organic compounds9 implying conditions that could support their growth in these samples, uncultured Microscillaceae, have been found in several soils but are not so common in the rhizosphere10 .

Overall it seems that samples B1-B12 exhibited very high Bacterial diversity that was following different community patterns encountered in different soils. Apart from the different diazotrophic bacteria that were prevalent in different samples another characteristic commonly encountered was the potential for UV protection that appears in close to surface communities.

Table I.Plot number, corresponding mix used for soil impregnation (carrier and strain) and yield in (g).
Plot Carrier Strain Yield (g)
B1 Zeolite Bacillus amyloliquefaciens, subgroup B. subtilis strain RS-3 3500
B2 Liquid B. thuringiensis, subgroup B. cereus strain 109/18 4400
B3 Liquid B. subtilis, subgroup B. subtilis strain 548 4500
B4 Biochar Control 4200
B5 Zeolite Control 4250
B6 Liquid B. subtilis, subgroup B. subtilis strain Z3 5700
B7 Biochar Bacillus mojavensis, subgroup B. subtilis strain 5B2 4500
B8 Zeolite Bacillus mojavensis, subgroup B. subtilis strain 5B2 3150
B9 Zeolite Bacillus subtilis, subgroup B. subtilis strain 557 4200
B10 Biochar Bacillus amyloliquefaciens, subgroup B. subtilis strain RS-3 5650
B11 Liquid Bacillus amyloliquefaciens, subgroup B. subtilis strain RS-3 2950
B12 Liquid Bacillus subtilis, subgroup B. subtilis strain 557 3100

Machine Learning Algorithm

Using the metagenomic data above regarding the bacterial communities found in the soil and the range of yield, as well as the weather data and soil analysis, we developed a dataset which was used for training our machine learning model. The dataset follows a preprocessing procedure which includes removing NAs and reformatting values. Then, correlations are extracted using statistical methods (Pearson, Spearman, and Kendall). This step is important since we can find important variables in the dataset. The next step is to test 17 different classifiers in order to find the best one in terms of accuracy. Lastly, we find the optimal hyperparameters for the selected algorithm and train the model.

For the development of the algorithm, which is used for exporting the final report, R version 4.2.0 (2022-04-22) was used. R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. Using this programming language, it was easy to handle the dataset. First of all, the dataset was processed in such a way in order to fix its columns. Namely, the type of the microorganism was turned to the title of the variable which contains the ratios of the specific microorganism. This way, the variables were reduced as well as the dataset was easier for everyone to understand. The next procedure was to remove rows that included NAs. Correlation of the variables was extracted using Pearson, Spearman and Kendall statistical methods. This is another tool that can be used in order to reduce the variables and make the model faster and more accurate. Then a folder was created in order to save the results and the variable that we want to predict is turned into a factor. Factor variables are categorical variables that can be either numeric or string variables. With the change of the variable, it is able to be inserted to a classifier in order to make the prediction. The function as.factor() is used for that purpose, a built-in R function that converts a column to factor. Then, 17 different classifiers were put into test in order to find the best one in terms of speed and accuracy, using the caret package of the R programming language was used. The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for: data splitting, pre-processing, feature selection, model tuning using resampling variable importance estimation as well as other functionality. The 17 classifiers used are presented in Table II.

Table II.List of the classifiers tested.
Model Method Type Libraries Tuning Parameters
Bagged AdaBoost AdaBag Classification adabag, plyr mfinal, maxdepth
Model Averaged Neural Network avNNet Classification, Regression nnet size, decay, bag
Stacked AutoEncoder Deep Neural Network dnn Classification, Regression deepnet layer1, layer2, layer3, hidden_dropout, visible_dropout
CART rpart Classification, Regression Rpart cp
k-Nearest Neighbors kknn Classification, Regression kknn kmax, distance, kernel
k-Nearest Neighbors knn Classification, Regression knn k
Logistic Model Trees LMT Classification RWeka iter
glmnet glmnet Classification, Regression glmnet, Matrix alpha, lambda
Penalized Multinomial Regression multinom Classification nnet decay
Naive Bayes naive_bayes Classification naivebayes laplace, usekernel, adjust
Optimal Weighted Nearest Neighbor Classifier ownn Classification snn K
Penalized Discriminant Analysis pda Classification mda lambda
Partial Least Squares pls Classification, Regression pls ncomp
Random Forest rf Classification, Regression randomForest mtry
Support Vector Machines with Linear svmLinear Classification, Regression kernlab C
Support Vector Machines with Radial Basis Function Kernel svmRadial Classification, Regression kernlab sigma, C
eXtreme Gradient Boosting xgbTree Classification, Regression xgboost, plyr nrounds, max_depth, eta, gamma, colsample_bytree, min_child_weight, subsample

After we found the algorithm we will use, based on its accuracy and time, we trained the model again using different hyper-parameters in order to optimize it. Moreover, every time the dataset updates, the algorithms will be tested again. In future work we aim to add more classifiers to test. Another tool we used in order to get the same results each time was the set. seed() function. This R function is used to create reproducible results when writing code that involves creating variables that take on random values. By using the set. seed() function, you guarantee that the same random values are produced each time you run the code.

Appendix

References
  1. Schloss, P.D., Westcott, S.L., Ryabin, T., Hall, J.R., Hartmann, M., Hollister, E.B., et al. (2009) Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities. Appl Environ Microbiol 75: 7537-7541.
  2. Pruesse, E., Quast, C., Knittel, K., Fuchs, B.M., Ludwig, W., Peplies, J., and Glöckner, F.O. (2007) SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res 35: 7188-7196.
  3. Li, X., Chu, Y., Jia, Y., Yue, H., Han, Z., and Wang, Y. (2022) Changes to bacterial communities and soil metabolites in an apple orchard as a legacy effect of different intercropping plants and soil management practices. Frontiers in Microbiology 13:.
  4. Xie, Z., Yu, Z., Li, Y., Wang, G., Liu, X., Tang, C., et al. (2022) Soil microbial metabolism on carbon and nitrogen transformation links the crop-residue contribution to soil organic carbon. npj Biofilms Microbiomes 8: 1-10.
  5. Chen, R.-W., He, Y.-Q., Cui, L.-Q., Li, C., Shi, S.-B., Long, L.-J., and Tian, X.-P. (2021) Diversity and Distribution of Uncultured and Cultured Gaiellales and Rubrobacterales in South China Sea Sediments. Frontiers in Microbiology 12:.
  6. Li, F.-N., Tuo, L., Lee, S.M.-Y., Jin, T., Liao, S., Li, W., et al. Amnibacterium endophyticum sp. nov., an endophytic actinobacterium isolated from Aegiceras corniculatum. International Journal of Systematic and Evolutionary Microbiology 68: 1327-1332.
  7. Gao, H., Li, S., and Wu, F. (2021) Impact of Intercropping on the Diazotrophic Community in the Soils of Continuous Cucumber Cropping Systems. Frontiers in Microbiology 12:.
  8. Zhang, Q., Araya, M.M., Astorga-Eló, M., Velasquez, G., Rilling, J.I., Campos, M., et al. (2022) Composition and Potential Functions of Rhizobacterial Communities in a Pioneer Plant from Andean Altiplano. Diversity 14: 14.
  9. Huber, K.J. and Overmann, J. 2018 Vicinamibacteraceae fam. nov., the first described family within the subdivision 6 Acidobacteria. International Journal of Systematic and Evolutionary Microbiology 68: 2331-2334.
  10. Company, J., Valiente, N., Fortesa, J., García-Comendador, J., Lucas-Borja, M.E., Ortega, R., et al. (2022) Secondary succession and parent material drive soil bacterial community composition in terraced abandoned olive groves from a Mediterranean hyper-humid mountainous area. Agriculture, Ecosystems & Environment 332: 107932.