Modeling | William_and_Mary

This page explains the technical details of our modeling techniques. We have developed two major categories of models for our software: (1) prediction models to predict the bacteria abundance in a given condition, and (2) growth models to predict the metabolic pathway of a certain bacteria. For the predictive model, we used our GIS dataset to predict relative abundance of bacteria in a given condition. The inputs to our model are independent variables such as temperature, amount of certain elements, and pH. The output of our model is relative abundance of bacteria. An earlier version of our software contained four different regression models: Linear Regression, K Nearest Neighbor (KNN), Random Forest and an Artificial Neural Network. Our final predictive software contains everything but KNN and Random Forest. For growth models, Genome-scale metabolic models (GEMs) are used to model the metabolic activity of bacteria. This page explains the introductions, mathematical principles, implementations and results of our different models.

Dimensionality Reduction and Data Visualization

Principal Component Analysis (PCA)

Principal Component Analysis is an unsupervised dimensionality reduction method that reduces the complexity of data. Principal Component analysis (PCA) uses eigenvalue decomposition to compress and denoise data (Abdi 2010). It creates new vector variables (Principle Components), composed of linear combinations of the original inputs that are orthogonal to each other (Abdi 2010). PCA is widely used in practical scenarios to find the primary axes of variance and spot outliers. The primary shortcoming of PCA is that each principal component is mathematically required to be comprised of each of the input variables (). For example if you have 50 explanatory variables each principal component would have 50 coefficients. This tends to result in less interpretable data.

In our project we used PCA to reduce the dimensionality of our data. We broke up our environmental parameters into linearly independent vectors, where only three of the vectors represented ninety-nine percent of our data. We looked at the magnitude of coefficients associated with the parameters making up the linearly independent vectors to learn which variables we should expect to be most important. In one of our PCAs, we saw that PC1 represented 60% of the variance in our data. Essentially, of the explanatory variables, this one vector, or principal component, will represent 60% of their predictive ability. This principal component, which is comprised of linear combinations of all of the input variables, was primarily impacted by the nitrogen and carbon content independent variables. Coupled with our regression results, we reaffirmed that nitrogen and carbon concentrations in the soil were important for predicting bacterial abundance. We initially used PCA to make predictions about a bacteria’s presence in an environment and plotted the principal components for data visualization. However, we soon realized that Linear Discriminant Analysis produced better results in prediction and data visualization.

Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is also a dimensionality reduction technique. However, unlike PCA, its primary function is to show differences in categorical variables. It does so by creating components, referred to as linear discriminants, that optimize differences in categorical variables in order to visualize distinct classes (Dash, 2022). In the context of our project, we used LDA to visualize the differences between the observed living conditions of specific bacterial genera. We got the following output:

In the images above, data from our most complete subset created three distinct linear discriminants. From this analysis, we learned that even within this subset there were a few outliers, indicating that there is potential error in the dataset. Additionally, it was difficult to distinguish between the genus categories, suggesting that there is a large amount of overlap between bacterial growth environments. These results indicated that the range of conditions -2.5<LD2<2.5 and -2.5<LD2<2.5 were generally advantageous for any bacterial growth.

Additionally, looking at the linear combinations, we could see very small coefficients associated with respiration squared and organic carbon squared, which indicates that they have very little impact on representing variance between bacteria in this sample and will most likely not be relevant in the regression. However, in the regression, we found that organic carbon and respiration played a large roll in predicting relative abundance. So, while they are important for predicting abundance they are not helpful for making distinctions between bacteria. After this initial LDA we used multiple different subsets as well as changed the way we were classifying bacteria to get our final visualization.

Originally, we looked at our relative abundance data and classified the bacteria as being present if they had a relative abundance above the median. For samples that had multiple bacteria present, we would duplicate the soil parameters so that each sample was associated with one genera. Therefore, if a sample originally had both E. coli and B. subtilis present, we input the environmental conditions twice, once per species. However, in the above LDA, we classified the bacteria in groups if there were multiple present. We changed our approach due to the significant overlap in environmental conditions between different genera.

The above graph was created from 80% of our data, known as the training data. The remaining 20% was set aside as testing data. We tasked the LDA to predict if the bacteria would be in one of the following unique categories. Our result was that 45% of the time the LDA was able to predict the exact bacterial genera that would be present. This model would count a prediction as wrong if it did not predict all present bacterial species. Therefore, in reality, this model most likely has a large sum of predictions that are half correct.

The PCA and LDA were able to successfully reduce our soil parameters into lower dimension variables. Additionally, they were imperative to indicating and visualizing the presence of outliers. The LDA specifically was a strong predictor of bacterial presence and if we had more time it would have been used to predict species data, which would then be input into our software. However, due to time constraints these two analysis techniques were primarily used to corroborate the regression models results and visualize the differences between growth environments across genera.

Linear Regression and Data Analysis

Implementation

The very first step in our regression analysis was collecting all of the data we would use. After all, the regression is only as good as the data itself. To do this, we employed a multi-step process - sanitizing and ensuring consistency of data sampled from databases, individual studies, and fieldwork. We start with interpreting data in the form of data sources . These data sources are computationally expensive to process due to extraneous or incomplete information and inconsistent or unknown units and scales.

To resolve these problems, we create metagenomic subsets based on environments, raw data sources, and the most abundant genera and species in the raw datasets. These metagenomic subsets are created using samples from raw data sources, where missing metadata is supplemented using GIS sources, smart defaulting strategies, and automatic unit conversion. Relative abundance information is then truncated and appended to subsets, providing output information corresponding to each sample.

In order to ensure the validity of our regression, we made the following four assumptions about our data. First, we assumed linearity. Linearity is the assumption that our explanatory variables actually have a linear relationship with our predicted variables. While it is possible that there is a non-linear relationship, it is not unreasonable to assume this. Note: This is an assumption about linearity in coefficients, not variables. It is very common for squared and cubic functions of explanatory variables to be included in a regression. Secondly, we assumed homoscedastic error. Homoscedasticity is the assumption that our error is consistent across the range of the input variables (Yang 2019). While our samples were taken from a broad range of studies we assume that the error from these measurements were consistent. Thirdly, we assumed independence. Independence assumes the measurements are independent of each other (NIST Computer Security Resource Center, n.d.). We think this assumption is reasonable given that our data comes from numerous different independent studies. Lastly, we assume normality. Normality is the assumption that for any fixed value of X our Y variable is normally distributed (Boston University School of Public Health, 2016). While this assumption is difficult due to the initial spread of our data, we were able to cultivate subsets that had normal distributions of output data as well as use a logistic transformation to get a normal distribution. Once these assumptions were met we began our statistical analysis.

We used dummy, interactive, logarithmic, and high degree polynomial variables in multiple iterations of regressions with multiple subsets of data. Dummy variables are binary variables (either a 0 or 1) that represent a specific subgroup of a sample. In other words, they represent vertical shift in prediction between the original group and the subgroup. For example, we used dummy variables to represent winter. It is expected that fewer bacterial species will survive in winter environments; therefore, for those specific bacterial regression, we expect a larger negative coefficient associated with the winter dummy variable. This dummy variable will shift the original prediction downwards to represent the new expected abundance in the winter. Cold-resistant bacteria would have a lower magnitude of decrease because they are less affected by the change in temperature. However, from our regression we found that the associated p-score of seasonal dummy variables were not statistically different from zero. Therefore, bacterial species’ survival does not appear to be impacted by the seasons, and we are not able to predict the relative abundance of bacteria with any certainty. Therefore dummy variables were not useful in our predictions and we eliminated them from our regression.

Interactive variables represent how a change in one variable could impact the effect of another. For example, if I have a bacterium that survives in a high pH environment, maybe recent rainfall would have more of an impact on bacterial abundance. With 64 variables it was difficult to isolate which interactions were important. In the last iteration of the regression we created a python code that would build all of the possible combinations of interactive variables(This created 1040 interactions). Unfortunately, we were unable to use dimensional reduction techniques to just observe the statistically significant interactions. However, we were able to observe the reactions on manual, case by base basis.

Lastly, the high degree and logarithmic polynomials make up most of our predictive and interpretable analysis. These account for the relationship between variables x and y as well as their rates of change with respect to each other and how the percent change of an input parameter would affect a percent change in the output Y. When only statistically significant variables are isolated we create interpretable coefficients that can then be used to understand how the soil parameters impact the regression. There are various ways to interpret a regression, here are a few techniques we used.

Suppose Y= b₀+b₁x₁+b₂x₂+...+ bₙxₙ is our regression. We can then find the derivative of both sides with respect to xₙ to find, dy/dxₙ= bₙ. This indicates a one unit change in parameter xₙ leads to a bₙ change in our Y prediction.
Suppose Y= b₀+b₁x₁+b₂x₁²+b₂x₂+...+bₙxₙ +bₙ₊₁xₙ² is our regression. We can find the second derivative of both sides with respect to xₙ to find, 2*bₙ₊₁xₙ +bₙ=d²y/dxₙ². If this were set to zero we could solve to get xₙ=-bₙ/2*bₙ₊₁. Depending on the original prediction this shoew that at xₙ=-bₙ/2*bₙ₊₁ the predictions change direction. In other words with an increase in xₙ we will start to see a decrease in predictions.
Lastly, we have the log-linear, linear-log, and log-log assumptions. The equations and interpretations are as follows:
One Unit change in X leads to 100 * β percent change in γ
One percent change in X leads to β/100 unit change in γ
One percent change in X leads to β percent change in γ

Linear Regression

Multivariate linear regression is one of the most popular statistical analysis methods, in which quantitative relationships between explanatory variables (soil parameters) and dependent variables (bacterial relative abundance) are established and analyzed. These relationships are described through coefficients that are calculated by minimizing the mean squared error through gradient descent and linear programming. The mathematical process is as follows:

Step 1 : The algorithm automatically sets up an equation relating a matrix of explanatory variables, X , to a matrix of the dependent variables, Y, through 𝜷. In this case, the error in a given prediction is given as e. This is the value that we hope to minimize in order to optimize the accuracy of 𝜷.

Y= the matrix of dependent variables/bacterial abundance
X= Any set of explanatory variables from X₁->Xₙ that describe Y (We use just X do denote the matrix of these predictions)/soil parameters
𝜷= matrix of coefficients that are being optimized
e= error between prediction and measured value

Step 2 : We define our mean squared error (MSE) as the sum of each individual error term squared, divided by the number of predictions. MSE is essentially a measure of how accurate our prediction is at the moment. We square our error to account for receiving negative error coefficients.

MSE= Mean squared error
n=number of independent variables
e= error between prediction and measured value

Step 3 : In this step we rearrange the first equation to equal e and substitute into the second equation. Because X is a matrix of explanatory variables, we use the following equation to calculate the squared summation.

Step 4 : The next step is to find 𝜷. For a single variable regression, finding 𝜷 is as simple as finding the derivative and setting it to zero. However, for a multivariate regression, we employ Gradient Descent. Gradient Descent is a multivariable technique to find the lowest point in multidimensional data to minimize error values.

𝜷new= newly predicted 𝜷 after the time step
𝜷old= initial prediction before the time step
lr= learning rate or step size of the function

J(𝜷)= the Jacobian of MSE

This math is done almost instantaneously through R Studio. The code would generally look like this:


reg<- lm(Y~ x1 + x2 + x3 + ... + xn)
Summary(reg)

This code will output a table of coefficients, the standard deviation of each input variable, and the statistical likelihood of each 𝜷 coefficient not being equal to zero, which indicates whether the explanatory variables are statistically relevant. Additionally, it will output the distribution of variance, which is the distance between each prediction and regression line. Lastly, it will output an R2 and adjusted R2 score. These are common measurements for how accurately the regression describes the provided data. A higher R2 and adjusted R2 suggesting a better fit of the data. If you have an R2 of .55, it could be said your regression accounts for 55% of the variance in the data. Examples of these output tables will be shown and explained in detail below.

Results

Over the course of our research, we had varying results in terms of statistically relevant coefficients, as well as the magnitude and sign of said coefficients. We can use the regression to make statistically sound conclusions about the soil metadata and to draw conclusions about bacterial survivability in different environments. For example, in the below regression, we considered the impact of several different environmental factors on Pseudomonas relative abundances.

We made the following conclusions:

By measuring the 13 soil parameters included in the above regression, researchers can make a conclusive prediction for the relative abundance of a bacterial genera in that environment.
Secondly, we were able to draw conclusions about the impact of certain environmental factors on bacterial growth. For example, our log(total_nitrogen + .001) coefficient of .03717 indicates that a one percent increase in total nitrogen makes a .03717 percentage increase in bacterial abundance expected. This conclusion was reached using the log-log function mentioned in this section. Additionally, we determined that with a 1 unit increase in soil density there is a 4.4 percent increase in expected relative abundance.

These are just some of the many conclusions that researchers will be able to make with these regression results. Undoubtedly, as recording and publishing uniformly organized metadata becomes more commonplace these regressions will improve in accuracy. For now, we have built the framework for researchers to analyze the data that exists in an easily accessible format. Additionally, our software was built to be easily retrained, so if a researcher came across a new set of soil data they could input it into our model and receive the regression analysis.

Random Forest Regression (RFR)

Random Forest Regression is a common and effective supervised learning algorithm in data analysis (Ho 1995)). Random forest belongs to the Bagging algorithm. Bagging algorithm is a technique that combines predictions from multiple machine learning algorithms to make more accurate predictions than any single model ( Opitz 1999).The general idea of Bagging algorithm is to train multiple weak models and package them together to form a strong model. Therefore, the performance of the final model is much better than that of a single weak model. The stronger models can be composed of weaker models, such as decision trees, SVM, and other models.

In the training stage, random forest uses bootstrap sampling, random sampling with replacement, to collect multiple different sub-training datasets from the input training dataset, and uses these to train multiple different decision trees(support tool that uses a tree-like model of decisions and their possible consequences) successively. In the prediction stage, random forest averages the prediction results of multiple internal decision trees to get the final result. The implementation of RFR(regression random forest) has the following functions: model training, model data prediction, computational feature importance.

Formulas

Random Forest Regression (RFR) is a combination of binary decision trees (CART, which is also the model implemented internally by Sklearn and Spark), and training RFR is training multiple binary decision trees (Klusowski, 2020). In the training of binary decision tree models, it is necessary to consider how to select the segmentation variables (features) and the segmentation points and how to measure the quality of a segmentation variable and segmentation points. For the selection of segmentation variables and segmentation points, this implementation uses the exhaustive method, that is, to traverse each feature and all the values of each feature, and finally find the best segmentation variables and segmentation points . The quality of the shared variable and the shard point is generally measured by the impurity of the nodes after shard, that is G(x_i, v_ij), the weighted sum of impurity of each child node. The calculation formula is as follows (Klusowski, 2020):