Results | William_and_Mary

This page details the results we have achieved with our software design: 1) The results of our metagenomic data analysis are our complete GIS dataset 2) Our neural network results explain the accuracy that our neural network achieved and how it was optimized during the training process. 3) The KNN regression results show how well our trained KNN model fits the validation data. 4) The Random Forest regression results show how well our trained Random Forest model fits the validation data. 5) The linear regression results shows the in depth analysis of our dataset and how the linear model fits our dataset. 6) Our GEM models results demonstrate how the GEM model successfully predicts metabolic behavior. 7) Our gut microbiome database is complete and ready for users.

To view our software toolkit, please visit the Gitlab: chassEASE.

1. Metagenomic Data Analysis Results

We managed to create a large database from many different sources including bacterial data from different environments: air, water and soil. Although the database does not include a model that gives a statistical output, it is useful for researchers to have bacterial data to make predictions of survival for their desired environment. Additionally, a complete meta dataset serves as the foundation for the next parts of our software, such as our prediction models.

Snapshot of our dataset:

2. Neural Network Results

These results show the accuracy of our neural network on the training and validation data. Our model reached an accuracy of 71% after training with training data. To optimize our prediction accuracy, we used the optimizer “adam”. We used 1000 epochs, giving the model plenty of space to optimize from losses.

The following graph shows the training accuracy and validation accuracy of our neural network during the training process. The results indicate that the accuracy of our model increases over time. At the end of training, our results demonstrate that the validation accuracy is higher than the training accuracy, meaning that our model does not have the problem of overfitting.

The major advantage of our neural network model is the ability of ANNs to learn and organize complex models. In reality, the relationships between various inputs and outputs are interpersonal and complex. After training our neural network model, it can not only predict bacteria reads but also give a generalization of the pattern of the dataset.

Proof of Prediction

To validate the accuracy of the chassEASE software package, we first generated a metagenomic subset of QIITA using the standard process which contained parameters we accumulated: metagenome_type, sample_time, season, latitude, longitude, depth, elevation, biome_class, land_use, moisture, humidity, soil_temperature, environment_temperature, ph, total_carbon, total_nitrogen, total_phosphorus, and carbon_nitrogen_ratio.

We then used our AI/Artificial Neural Network model to generate predicted relative abundance values of bacteria that were found in our 16S sequencing results. These predictions were evaluated by determining the likelihood that predictions with equal or higher accuracy would be generated by an “educated blind algorithm,” a hypothetical algorithm that is aware of overall trends in bacterial abundance but unaware of specific environmental conditions.

To model the probability of this hypothetical algorithm generating a certain value, we created a normal distribution of relative abundance for each target chassis (a bacterial genre or species) using the overall QIITA database, under the assumption that the probability of a blind algorithm creating a certain prediction would follow a normal curve.

Using these normal curves, we determined the probability that a random algorithm would have made a prediction as effective or better than our algorithm by finding the area under the curve between the chassEASE prediction and the real value, and adding it to the area under the curve between the real value, and a prediction of equal accuracy in the opposite direction of the real value. This will generate a measure of accuracy for every genus and species prediction varying between 0 (perfectly matches real data) and 1 (indistinguishable from random guessing).

To create an overall evaluation, we then took the average of these accuracy measurements across each species and genus predicted for one of our samples.

We calculated the likelihood that our software is randomly guessing its output results for predictions of individual species and genera. Averaged across our 15 sequence samples and all bacterial genera present, there is a 15.9% chance that our software is randomly guessing. This value increases slightly to 28.3% when predicting individual species. These results indicate that our software program is predicting the presence of these bacterial species as opposed to simply guessing.

To learn more about the components of our software, please see our Proof of Concept page

3. K-Nearest Neighbor (KNN) Results

This results section shows how our KNN regression model performs on the validation dataset. For KNN results, there is not a percentage to indicate the accuracy. Instead, several statistical error functions are used to give an understanding of the accuracy of the predictions.

Our validation results show:

Pseudomonas: MAE = 0.024, RMSE= 0.073 MSE = 0.005 R-Squared = 0.192
Mycobacterium:MAE = 0.001, RMSE= 0.0026, MSE = 6.85e-6, R-Squared = 0.326
Bacillus: MAE = 0.009, RMSE= 0.027, MSE = 0.0007, R-Squared = 0.033

For Pseudomonas prediction, the MAE is 0.024, meaning the absolute values of the Pseudomonas abundance prediction errors on overall instances in the test set is 0.024. RMSE is 0.073, meaning that our standard deviation of the Pseudomonas prediction errors is 0.073. MSE = 0.005 indicates that the average of the squares on the errors on Pseudomonas prediction is 0.005. And 0.192 R-Squared means 19.2% of the variance of Pseudomonas abundance is explained by the predictors (independent variables). For Mycobacterium prediction, the MAE is 0.001, meaning the absolute values of the Mycobacterium abundance prediction errors on over all instances in the test set is 0.001. RMSE is 0.0026, meaning that our standard deviation of the Mycobacterium prediction errors is 0.0026. MSE = 6.85e-6 indicates that the average of the squares on the errors on Mycobacterium prediction is 6.85e-6. And 0.326 R-Squared means 32.6% of the variance of Mycobacterium abundance is explained by the predictors (independent variables). For Bacillus prediction, the MAE is 0.009, meaning the absolute values of the Bacillus abundance prediction errors over all instances in the test set is 0.009. RMSE is 0.027, meaning that our standard deviation of the Bacillus prediction errors is 0.027. MSE = 0.0007 indicates that the average of the squares on the errors on Bacillus prediction is 0.0007. And 0.033 R-Squared means 3.3% of the variance of Bacillus abundance is explained by the predictors (independent variables).

KNN models can serve as a model to estimate a value without expecting large error. KNN is relatively a mature theory which can be used to do regression. The training time complexity is lower than that of algorithms like supervised learning. Compared to algorithms like Naive Bayes, no assumptions on the data are needed. Moreover, it provides high accuracy and is less sensitive to outliers. This is achieved through the KNN method mainly depending on the limited neighboring samples around the data point being predicted, rather than the method of discriminating class domain to determine the category.

4. Random Forest Results

This part shows how our Random forest regression model performs on the validation dataset. For Random forest results there is not a percentage to indicate the accuracy. Instead, several statistical error functions are used to give an understanding of how well the predictions are..

Pseudomonas: MAE = 0.024, RMSE= 0.071 MSE = 0.005 R-Squared = 0.243
Mycobacterium: MAE = 0.001, RMSE= 0.0025, MSE = 6.35e-6, R-Squared = 0.376
Bacillus: MAE = 0.006, RMSE= 0.023, MSE = 0.0005, R-Squared = 0.302

For Pseudomonas prediction , the MAE is 0.024, meaning the absolute values of the Pseudomonas abundance prediction errors on overall instances in the test set is 0.024. RMSE is 0.071, meaning that our standard deviation of the Pseudomonas prediction errors is 0.071. MSE = 0.005 indicates that the average of the squares on the errors on Pseudomonas prediction is 0.005. And 0.243 R-Squared means 24.3% of the variance of Pseudomonas abundance is explained by the predictors (independent variables). For Mycobacterium prediction, the MAE is 0.001, meaning the absolute values of the Mycobacterium abundance prediction errors over all instances in the test set is 0.001. RMSE is 0.0025, meaning that our standard deviation of the Mycobacterium prediction errors is 0.0025. MSE = 6.35e-6 indicates that the average of the squares on the errors on Mycobacterium prediction is 6.35e-6. And 0.376 R-Squared means 37.6% of the variance of Mycobacterium abundance is explained by the predictors (independent variables). For Bacillus prediction, the MAE is 0.006, meaning the absolute values of the Bacillus abundance prediction errors over all instances in the test set is 0.006. RMSE is 0.023, meaning that our standard deviation of the Bacillus prediction errors is 0.023. MSE = 0.0005 indicates that the average of the squares on the errors on Bacillus prediction is 0.0005. And 0.302 R-Squared means 30.2% of the variance of Bacillus abundance is explained by the predictors (independent variables).

Random Forest has great advantages over other algorithms on many current data sets and performs well. It can handle very high dimensional data without feature selection (feature subsets are randomly selected). After training, it can give which features are more important. When creating random forest, unbiased estimation is used for Generalization Error, and the model has strong generalization ability. Random forest training is fast and easy to parallelize (trees are independent of each other during training). For an imbalanced dataset like ours, it balances the error. Accuracy can be still maintained if a significant portion of the features are missing.

5. Regression Results

After successfully collecting 16S data from soil samples collected by our team, we began testing our regression model for predicting Nitrospira with the 16s data. We built a linear regression model using our 16S data from soil samples as training data, leading to a 0.91 R-Squared, which is a high accuracy. Yet when testing the model with a subset of GIS data as testing data, we found the accuracy is low. Therefore the model using 16s dataset as training data is overfit. Here are the regressions side by side.

We created linear regression models for the first 50 bacteria sequenced, and gained the following results:

This is yet another example of how good data is necessary for the accuracy of a regression, but also a copious amount of it. Without both quality and quantity of data we wont be able to make complete and statistically relevant conclusions.

Lastly, we decided to create regressions for the first 50 genera and record our R^2 values. The plot above is meant to exemplify that with accurate measurements our regression accuracy improves, but without more metadata they are still incomplete. We hope that as these values increase the mean adjusted R^2 will increase and overall predictability will increase.

6. GEM Results

The primary results of GEMs came in the form of our collaboration with Gaston Day School (GDS). We were presented with modeling their E. coli circuit as well as modeling if L-Phenylalanine was a limiting factor in the production of cinnamaldehyde. However, we also wanted to provide them with an interactive software to simulate these limiting factors in various media conditions. We chose GEMs as our model for GDS because we could add their engineered reactions as well as use Flux Balance Analysis (FBA) to predict their growth with limited media conditions. The primary output was a google collaboratory that allowed GDS students to change media conditions, add and delete reactions, and test for growth rates of any bacteria with a GEM. The general code is as follows:

To view our full code for the Gaston Day School Collab, please visit the Gitlab here.

7. Gut Microbiome Results

We successfully collected Metadata for in total 353 projects, containing 71,642 runs (samples) are available in GMrepo. We eliminated samples in which there is no information of genus or species, leaving us about 37,000 samples. Each sample contains information about the age, BMI, country of origin and the relative abundance of genus and species in his or her gut. We used a self-developed data scraper to collect the data in all of 353 projects and successfully designed algorithms that take human conditions as input such as country, age and BMI, and give users the ranking of both the most dominant species and genus of samples that meets the conditions given.

Snapshot of our dataset collected:

8. Conclusion

Overall, our software can predict bacterial relative abundance based on environmental parameters better than random chance. However, there is still much to improve on. Our project also serves as a call to action in the field in order to stress the importance of collecting and providing metadata with 16s data. One of our challenges with making accurate models was the difficulty of finding 16s data with available metadata. We will continue to strive towards making our software more accurate in the future.