Modeling


This page explains the technical details of our modeling techniques. We have developed two major categories of models for our software: (1) prediction models to predict the bacteria abundance in a given condition, and (2) growth models to predict the metabolic pathway of a certain bacteria. For the predictive model, we used our GIS dataset to predict relative abundance of bacteria in a given condition. The inputs to our model are independent variables such as temperature, amount of certain elements, and pH. The output of our model is relative abundance of bacteria. An earlier version of our software contained four different regression models: Linear Regression, K Nearest Neighbor (KNN), Random Forest and an Artificial Neural Network. Our final predictive software contains everything but KNN and Random Forest. For growth models, Genome-scale metabolic models (GEMs) are used to model the metabolic activity of bacteria. This page explains the introductions, mathematical principles, implementations and results of our different models.

Dimensionality Reduction and Data Visualization



Principal Component Analysis (PCA)

Principal Component Analysis is an unsupervised dimensionality reduction method that reduces the complexity of data. Principal Component analysis (PCA) uses eigenvalue decomposition to compress and denoise data (Abdi 2010). It creates new vector variables (Principle Components), composed of linear combinations of the original inputs that are orthogonal to each other (Abdi 2010). PCA is widely used in practical scenarios to find the primary axes of variance and spot outliers. The primary shortcoming of PCA is that each principal component is mathematically required to be comprised of each of the input variables (). For example if you have 50 explanatory variables each principal component would have 50 coefficients. This tends to result in less interpretable data.

In our project we used PCA to reduce the dimensionality of our data. We broke up our environmental parameters into linearly independent vectors, where only three of the vectors represented ninety-nine percent of our data. We looked at the magnitude of coefficients associated with the parameters making up the linearly independent vectors to learn which variables we should expect to be most important. In one of our PCAs, we saw that PC1 represented 60% of the variance in our data. Essentially, of the explanatory variables, this one vector, or principal component, will represent 60% of their predictive ability. This principal component, which is comprised of linear combinations of all of the input variables, was primarily impacted by the nitrogen and carbon content independent variables. Coupled with our regression results, we reaffirmed that nitrogen and carbon concentrations in the soil were important for predicting bacterial abundance. We initially used PCA to make predictions about a bacteria’s presence in an environment and plotted the principal components for data visualization. However, we soon realized that Linear Discriminant Analysis produced better results in prediction and data visualization.

Linear Discriminant Analysis


Linear Discriminant Analysis (LDA) is also a dimensionality reduction technique. However, unlike PCA, its primary function is to show differences in categorical variables. It does so by creating components, referred to as linear discriminants, that optimize differences in categorical variables in order to visualize distinct classes (Dash, 2022). In the context of our project, we used LDA to visualize the differences between the observed living conditions of specific bacterial genera. We got the following output:

In the images above, data from our most complete subset created three distinct linear discriminants. From this analysis, we learned that even within this subset there were a few outliers, indicating that there is potential error in the dataset. Additionally, it was difficult to distinguish between the genus categories, suggesting that there is a large amount of overlap between bacterial growth environments. These results indicated that the range of conditions -2.5<LD2<2.5 and -2.5<LD2<2.5 were generally advantageous for any bacterial growth.

Additionally, looking at the linear combinations, we could see very small coefficients associated with respiration squared and organic carbon squared, which indicates that they have very little impact on representing variance between bacteria in this sample and will most likely not be relevant in the regression. However, in the regression, we found that organic carbon and respiration played a large roll in predicting relative abundance. So, while they are important for predicting abundance they are not helpful for making distinctions between bacteria. After this initial LDA we used multiple different subsets as well as changed the way we were classifying bacteria to get our final visualization.

Originally, we looked at our relative abundance data and classified the bacteria as being present if they had a relative abundance above the median. For samples that had multiple bacteria present, we would duplicate the soil parameters so that each sample was associated with one genera. Therefore, if a sample originally had both E. coli and B. subtilis present, we input the environmental conditions twice, once per species. However, in the above LDA, we classified the bacteria in groups if there were multiple present. We changed our approach due to the significant overlap in environmental conditions between different genera.

The above graph was created from 80% of our data, known as the training data. The remaining 20% was set aside as testing data. We tasked the LDA to predict if the bacteria would be in one of the following unique categories. Our result was that 45% of the time the LDA was able to predict the exact bacterial genera that would be present. This model would count a prediction as wrong if it did not predict all present bacterial species. Therefore, in reality, this model most likely has a large sum of predictions that are half correct.

The PCA and LDA were able to successfully reduce our soil parameters into lower dimension variables. Additionally, they were imperative to indicating and visualizing the presence of outliers. The LDA specifically was a strong predictor of bacterial presence and if we had more time it would have been used to predict species data, which would then be input into our software. However, due to time constraints these two analysis techniques were primarily used to corroborate the regression models results and visualize the differences between growth environments across genera.

Linear Regression and Data Analysis



Implementation

The very first step in our regression analysis was collecting all of the data we would use. After all, the regression is only as good as the data itself. To do this, we employed a multi-step process - sanitizing and ensuring consistency of data sampled from databases, individual studies, and fieldwork. We start with interpreting data in the form of data sources . These data sources are computationally expensive to process due to extraneous or incomplete information and inconsistent or unknown units and scales.

To resolve these problems, we create metagenomic subsets based on environments, raw data sources, and the most abundant genera and species in the raw datasets. These metagenomic subsets are created using samples from raw data sources, where missing metadata is supplemented using GIS sources, smart defaulting strategies, and automatic unit conversion. Relative abundance information is then truncated and appended to subsets, providing output information corresponding to each sample.

In order to ensure the validity of our regression, we made the following four assumptions about our data. First, we assumed linearity. Linearity is the assumption that our explanatory variables actually have a linear relationship with our predicted variables. While it is possible that there is a non-linear relationship, it is not unreasonable to assume this. Note: This is an assumption about linearity in coefficients, not variables. It is very common for squared and cubic functions of explanatory variables to be included in a regression. Secondly, we assumed homoscedastic error. Homoscedasticity is the assumption that our error is consistent across the range of the input variables (Yang 2019). While our samples were taken from a broad range of studies we assume that the error from these measurements were consistent. Thirdly, we assumed independence. Independence assumes the measurements are independent of each other (NIST Computer Security Resource Center, n.d.). We think this assumption is reasonable given that our data comes from numerous different independent studies. Lastly, we assume normality. Normality is the assumption that for any fixed value of X our Y variable is normally distributed (Boston University School of Public Health, 2016). While this assumption is difficult due to the initial spread of our data, we were able to cultivate subsets that had normal distributions of output data as well as use a logistic transformation to get a normal distribution. Once these assumptions were met we began our statistical analysis.


We used dummy, interactive, logarithmic, and high degree polynomial variables in multiple iterations of regressions with multiple subsets of data. Dummy variables are binary variables (either a 0 or 1) that represent a specific subgroup of a sample. In other words, they represent vertical shift in prediction between the original group and the subgroup. For example, we used dummy variables to represent winter. It is expected that fewer bacterial species will survive in winter environments; therefore, for those specific bacterial regression, we expect a larger negative coefficient associated with the winter dummy variable. This dummy variable will shift the original prediction downwards to represent the new expected abundance in the winter. Cold-resistant bacteria would have a lower magnitude of decrease because they are less affected by the change in temperature. However, from our regression we found that the associated p-score of seasonal dummy variables were not statistically different from zero. Therefore, bacterial species’ survival does not appear to be impacted by the seasons, and we are not able to predict the relative abundance of bacteria with any certainty. Therefore dummy variables were not useful in our predictions and we eliminated them from our regression.


Interactive variables represent how a change in one variable could impact the effect of another. For example, if I have a bacterium that survives in a high pH environment, maybe recent rainfall would have more of an impact on bacterial abundance. With 64 variables it was difficult to isolate which interactions were important. In the last iteration of the regression we created a python code that would build all of the possible combinations of interactive variables(This created 1040 interactions). Unfortunately, we were unable to use dimensional reduction techniques to just observe the statistically significant interactions. However, we were able to observe the reactions on manual, case by base basis.


Lastly, the high degree and logarithmic polynomials make up most of our predictive and interpretable analysis. These account for the relationship between variables x and y as well as their rates of change with respect to each other and how the percent change of an input parameter would affect a percent change in the output Y. When only statistically significant variables are isolated we create interpretable coefficients that can then be used to understand how the soil parameters impact the regression. There are various ways to interpret a regression, here are a few techniques we used.

  • Suppose Y= b₀+b₁x₁+b₂x₂+...+ bₙxₙ is our regression. We can then find the derivative of both sides with respect to xₙ to find, dy/dxₙ= bₙ. This indicates a one unit change in parameter xₙ leads to a bₙ change in our Y prediction.
  • Suppose Y= b₀+b₁x₁+b₂x₁²+b₂x₂+...+bₙxₙ +bₙ₊₁xₙ² is our regression. We can find the second derivative of both sides with respect to xₙ to find, 2*bₙ₊₁xₙ +bₙ=d²y/dxₙ². If this were set to zero we could solve to get xₙ=-bₙ/2*bₙ₊₁. Depending on the original prediction this shoew that at xₙ=-bₙ/2*bₙ₊₁ the predictions change direction. In other words with an increase in xₙ we will start to see a decrease in predictions.
  • Lastly, we have the log-linear, linear-log, and log-log assumptions. The equations and interpretations are as follows:
  • One Unit change in X leads to 100 * β percent change in γ
  • One percent change in X leads to β/100 unit change in γ
  • One percent change in X leads to β percent change in γ

Linear Regression

Multivariate linear regression is one of the most popular statistical analysis methods, in which quantitative relationships between explanatory variables (soil parameters) and dependent variables (bacterial relative abundance) are established and analyzed. These relationships are described through coefficients that are calculated by minimizing the mean squared error through gradient descent and linear programming. The mathematical process is as follows:

Step 1 : The algorithm automatically sets up an equation relating a matrix of explanatory variables, X , to a matrix of the dependent variables, Y, through 𝜷. In this case, the error in a given prediction is given as e. This is the value that we hope to minimize in order to optimize the accuracy of 𝜷.

  • Y= the matrix of dependent variables/bacterial abundance
  • X= Any set of explanatory variables from X₁->Xₙ that describe Y (We use just X do denote the matrix of these predictions)/soil parameters
  • 𝜷= matrix of coefficients that are being optimized
  • e= error between prediction and measured value

Step 2 : We define our mean squared error (MSE) as the sum of each individual error term squared, divided by the number of predictions. MSE is essentially a measure of how accurate our prediction is at the moment. We square our error to account for receiving negative error coefficients.

  • MSE= Mean squared error
  • n=number of independent variables
  • e= error between prediction and measured value

Step 3 : In this step we rearrange the first equation to equal e and substitute into the second equation. Because X is a matrix of explanatory variables, we use the following equation to calculate the squared summation.

Step 4 : The next step is to find 𝜷. For a single variable regression, finding 𝜷 is as simple as finding the derivative and setting it to zero. However, for a multivariate regression, we employ Gradient Descent. Gradient Descent is a multivariable technique to find the lowest point in multidimensional data to minimize error values.

  • 𝜷new= newly predicted 𝜷 after the time step
  • 𝜷old= initial prediction before the time step
  • lr= learning rate or step size of the function
  • J(𝜷)= the Jacobian of MSE

This math is done almost instantaneously through R Studio. The code would generally look like this:


reg<- lm(Y~ x1 + x2 + x3 + ... + xn)
Summary(reg)

This code will output a table of coefficients, the standard deviation of each input variable, and the statistical likelihood of each 𝜷 coefficient not being equal to zero, which indicates whether the explanatory variables are statistically relevant. Additionally, it will output the distribution of variance, which is the distance between each prediction and regression line. Lastly, it will output an R2 and adjusted R2 score. These are common measurements for how accurately the regression describes the provided data. A higher R2 and adjusted R2 suggesting a better fit of the data. If you have an R2 of .55, it could be said your regression accounts for 55% of the variance in the data. Examples of these output tables will be shown and explained in detail below.


Results

Over the course of our research, we had varying results in terms of statistically relevant coefficients, as well as the magnitude and sign of said coefficients. We can use the regression to make statistically sound conclusions about the soil metadata and to draw conclusions about bacterial survivability in different environments. For example, in the below regression, we considered the impact of several different environmental factors on Pseudomonas relative abundances.

We made the following conclusions:

  • By measuring the 13 soil parameters included in the above regression, researchers can make a conclusive prediction for the relative abundance of a bacterial genera in that environment.
  • Secondly, we were able to draw conclusions about the impact of certain environmental factors on bacterial growth. For example, our log(total_nitrogen + .001) coefficient of .03717 indicates that a one percent increase in total nitrogen makes a .03717 percentage increase in bacterial abundance expected. This conclusion was reached using the log-log function mentioned in this section. Additionally, we determined that with a 1 unit increase in soil density there is a 4.4 percent increase in expected relative abundance.

These are just some of the many conclusions that researchers will be able to make with these regression results. Undoubtedly, as recording and publishing uniformly organized metadata becomes more commonplace these regressions will improve in accuracy. For now, we have built the framework for researchers to analyze the data that exists in an easily accessible format. Additionally, our software was built to be easily retrained, so if a researcher came across a new set of soil data they could input it into our model and receive the regression analysis.


Random Forest Regression (RFR)



Random Forest Regression is a common and effective supervised learning algorithm in data analysis (Ho 1995)). Random forest belongs to the Bagging algorithm. Bagging algorithm is a technique that combines predictions from multiple machine learning algorithms to make more accurate predictions than any single model ( Opitz 1999).The general idea of Bagging algorithm is to train multiple weak models and package them together to form a strong model. Therefore, the performance of the final model is much better than that of a single weak model. The stronger models can be composed of weaker models, such as decision trees, SVM, and other models.


In the training stage, random forest uses bootstrap sampling, random sampling with replacement, to collect multiple different sub-training datasets from the input training dataset, and uses these to train multiple different decision trees(support tool that uses a tree-like model of decisions and their possible consequences) successively. In the prediction stage, random forest averages the prediction results of multiple internal decision trees to get the final result. The implementation of RFR(regression random forest) has the following functions: model training, model data prediction, computational feature importance.


Formulas

Random Forest Regression (RFR) is a combination of binary decision trees (CART, which is also the model implemented internally by Sklearn and Spark), and training RFR is training multiple binary decision trees (Klusowski, 2020). In the training of binary decision tree models, it is necessary to consider how to select the segmentation variables (features) and the segmentation points and how to measure the quality of a segmentation variable and segmentation points. For the selection of segmentation variables and segmentation points, this implementation uses the exhaustive method, that is, to traverse each feature and all the values of each feature, and finally find the best segmentation variables and segmentation points . The quality of the shared variable and the shard point is generally measured by the impurity of the nodes after shard, that is G(x_i, v_ij), the weighted sum of impurity of each child node. The calculation formula is as follows (Klusowski, 2020):

Implementation

We used all of our explanatory variables (predictors) as input and the three dependent variables (bacteria abundance) which are Pseudomonos, Mycobacteria, and Bacillus. We split the predictors into two categories: continuous predictors (numeric values) and discrete predictors (categorical values), and we used python to import scikit-learn. We also used traditional data split to assign 80% of the dataset as training data and 20% of the dataset as validation data. Our error metric for the validation set is mean absolute error, mean squared error, root mean squared error, and R squared.

A python implementation of KNN and Random Forest models used to create evaluations of prediction model accuracy.


Results

Here are the validation results for the Random Forest Regression model. It measures the difference between predicted value and real value when fitting the validation set into our trained model. Error metrics for our model are Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Squared Error (MSE) and R squared (Willmott 2005.)

Validation result for Pseudomonas genus: MAE (Mean Absolute Error) = 0.024, RMSE (Root Mean Squared Error) = 0.071 MSE (Mean Squared Error) = 0.005 R-Squared = 0.243.

Validation results for the Mycobacteria genus: MAE = 0.001, RMSE= 0.0025, MSE = 6.35e-6, R-Squared = 0.376.

Validation results for the Bacillus genus: MAE = 0.006, RMSE= 0.023, MSE = 0.0005, R-Squared = 0.302.

For Pseudomonas prediction, the MAE is 0.024, meaning the absolute values of the Pseudomonas abundance prediction errors on overall instances in the test set is 0.024. RMSE is 0.071, meaning that our standard deviation of the Pseudomonas prediction errors is 0.071. MSE = 0.005 indicates that the average of the squares on the errors on Pseudomonas prediction is 0.005. And 0.243 R-Squared means 24.3% of the variance of Pseudomonas abundance is explained by the predictors (independent variables). For Mycobacterium prediction, the MAE is 0.001, meaning the absolute values of the Mycobacterium abundance prediction errors over all instances in the test set is 0.001. RMSE is 0.0025, meaning that our standard deviation of the Mycobacterium prediction errors is 0.0025. MSE = 6.35e-6 indicates that the average of the squares on the errors on Mycobacterium prediction is 6.35e-6. And 0.376 R-Squared means 37.6% of the variance of Mycobacterium abundance is explained by the predictors (independent variables). For Bacillus prediction, the MAE is 0.006, meaning the absolute values of the Bacillus abundance prediction errors over all instances in the test set is 0.006. RMSE is 0.023, meaning that our standard deviation of the Bacillus prediction errors is 0.023. MSE = 0.0005 indicates that the average of the squares on the errors on Bacillus prediction is 0.0005. And 0.302 R-Squared means 30.2% of the variance of Bacillus abundance is explained by the predictors (independent variables).


K-Nearest Neighbor Regression (KNN)



The K-nearest neighbor algorithm measures the distance between different predictors for regression and classification (Cover 1967) ). Overview of the process is: there is a sample data set, namely the training set, and each data in the training set has its corresponding y-value. After the input of new data without values, each predictor of the new data is compared with the corresponding predictor of the data in the sample set, and then the algorithm calculates the y-value of the data point within the range (Cover 1967)). That is to look at the data values and take the mean value of y values of the closest k data points.


Formulas

For K-nearest neighbor we have to calculate the distance between data points. To do this, we use the following Euclidean Distance formula, and use the y formula to predict a given point:

Implementation

We used all of the explanatory variables as input and the six dependent variables (bacteria reads). Used python to import scikit-learn. Used traditional data split to assign 80% of the dataset as training data and 20% of the dataset validation data. In a regression task, which predicts continuous values (not labels), kNN takes the mean of the nearest k neighbors. The regressor is readily-available using the python pre-built regression package sklearn.neighbors.KNeighborsRegressor. For this regression we used python’s default k value of 5.


Results

Validation results for the Pseudomonas genus: MAE = 0.024, RMSE= 0.073 MSE = 0.005 R-Squared = 0.192

Validation results for the Mycobacteria genus: MAE = 0.001, RMSE= 0.0026, MSE = 6.85e-6, R-Squared = 0.326

Validation results for the Bacillus genus: MAE = 0.009, RMSE= 0.027, MSE = 0.0007, R-Squared = 0.033

For Pseudomonas prediction, the MAE is 0.024, meaning the absolute values of the Pseudomonas abundance prediction errors on overall instances in the test set is 0.024. RMSE is 0.073, meaning that our standard deviation of the Pseudomonas prediction errors is 0.073. MSE = 0.005 indicates that the average of the squares on the errors on Pseudomonas prediction is 0.005. And 0.192 R-Squared means 19.2% of the variance of Pseudomonas abundance is explained by the predictors (independent variables). For Mycobacterium prediction, the MAE is 0.001, meaning the absolute values of the Mycobacterium abundance prediction errors on over all instances in the test set is 0.001. RMSE is 0.0026, meaning that our standard deviation of the Mycobacterium prediction errors is 0.0026. MSE = 6.85e-6 indicates that the average of the squares on the errors on Mycobacterium prediction is 6.85e-6. And 0.326 R-Squared means 32.6% of the variance of Mycobacterium abundance is explained by the predictors (independent variables). For Bacillus prediction, the MAE is 0.009, meaning the absolute values of the Bacillus abundance prediction errors over all instances in the test set is 0.009. RMSE is 0.027, meaning that our standard deviation of the Bacillus prediction errors is 0.027. MSE = 0.0007 indicates that the average of the squares on the errors on Bacillus prediction is 0.0007. And 0.033 R-Squared means 3.3% of the variance of Bacillus abundance is explained by the predictors (independent variables).


Neural Networks



Neural networks, also known as artificial neural networks (ANNs) or simulated neural networks (SNNs), are a subset of machine learning and are at the heart of deep learning algorithms (LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).). Its name and structure are inspired by the human brain, mimicking the way biological neurons send signals to each other. An artificial neural network (ANN) consists of a node layer that contains an input layer, one or more hidden layers, and an output layer. Each node is also called an artificial neuron, and they are connected to another node with associated weights and thresholds. If the output of any individual node is above the specified threshold, then that node is activated and sends the data to the next layer of the network. Otherwise, the data is not passed to the next layer of the network.


Formulas

We used a cross-entropy binary error function:

Where p(y) is the predicted relative abundance of the point for all N points

Adam optimizing function:

  • w_T+1 is the updated weight using Adam
  • Wt = weights at time t
  • Vt = sum of the square of past gradients
  • ∂L = derivative of Loss Function
  • ∂Wt = derivative of weights at time t
  • ϵ = a small +ve constant to avoid 'division by 0' error when (vt -> 0)
  • β1 & β2 = decay rates of an average of gradients
  • α = Step size parameter / learning rate

Implementation

We used python and Tensorflow to create 7 dense layers with one dropout layer between every two dense layers. We used cross-entropy for our loss function and the Adam function as our optimizer.


Results

Using our validation samples with size of 1,561 for testing the accuracy of our model. We reached a prediction final accuracy of 71%.


GEM Models



Out of our three predictive tools, our Genome-scale metabolic models are the only ones that are not based on our data collection. These models are stoichiometric matrices that map all of the possible metabolites present in an environment to the set of all reactions occuring in a cell (Orth 2010). We use Flux Balance Analysis (FBA) to predict unique growth rates of bacterial species given controllable media conditions. In order to run the FBA, we make steady state and optimality assumptions. In this case, steady state assumes that the metabolite conditions are constant, which allows us to set the input of metabolites equal to the output flux rate (Orth 2010). Secondly, optimality assumes that the primary function of the bacteria is to produce biomass (Orth 2010). We know that bacteria can enter growth arrested states where this is not the case, but for predicting growth, we deemed this assumption acceptable.


Flux Balance Analysis is our primarily application of GEMs. It outputs a rate of bacterial growth that is used as a prediction on the survivability of that chassis in given conditions. This technique uses a core biomass function as its objective function. It essentially groups all of the reactions that are necessary for biomass production into one matrix. Then using linear optimization, maximizes the rate at which biomass is produced (Feist 2010). Using the cobra toolbox (Cobra Core Team, 2019) we can change the GEMs we are optimizing (changing between species) as well as set the uptake rates and bounds of certain metabolites (simulating different environmental conditions). This allows us to simulate how a cells metabolism will grow in suboptimal conditions. The design process for FBA is as follows:



In our predictive model we took users desired input concentrations and converted them to a rate of uptake that was put into our FBA. We made two core assumptions, first that all discussion rates were modeled by the Michelin-Menten kinetics equation. This equation relates an uptake of a metabolite to the concentration outside the cell. We used crude estimations of Vmax and Michelis constant for each unique metabolite. Our second main assumption was that organic carbon content was directly proportional to glucose levels. Once these assumptions were made we took all concentrations input by the user, converted them to flux, and input them into the FBA. We observed that organic carbon and oxygen concentrations were the most important parameters when determining the cells growth rate, as most other nutrients could be derived from other sources. For all of the concentrations of metabolites that were not defined by our model, we left them as default concentrations, as set by the given GEM model.


However, the primary results of GEMs came in the form of our collaboration with Gaston Day School (GDS). We were presented with modeling their E. coli circuit as well as modeling if L-Phenylalanine was a limiting factor in the production of cinnamaldehyde. However, we also wanted to provide them with an interactive software to simulate these limiting factors in various media conditions. We chose GEMs as our model for GDS because we could add their engineered reactions as well as use Flux Balance Analysis (FBA) to predict their growth with limited media conditions. The primary output was a google collaboratory that allowed GDS students to change media conditions, add and delete reactions, and test for growth rates of any bacteria with a GEM. See the general code on our Results page.


References



  • Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley interdisciplinary reviews: computational statistics, 2(4), 433-459.
  • Boston University School of Public Health. (2016). Simple Linear Regression. Simple linear regression. Retrieved October 11, 2022, from https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R5_Correlation-Regression/R5_Correlation-Regression4.html
  • Cobrapy Core Team. (2019). Documentation for COBRApy. Documentation for COBRApy - cobra 0.25.0 documentation. Retrieved October 11, 2022, from https://cobrapy.readthedocs.io/en/latest/
  • Cover, T.M., Hart, P.E. Nearest neighbor pattern classification. IEEE Trans. Inform. Theory, IT-13(1):21–27, 1967
  • Dash, S. K. (2022, August 5). Linear discriminant analysis: What is linear discriminant analysis. Analytics Vidhya. Retrieved October 11, 2022, from https://www.analyticsvidhya.com/blog/2021/08/a-brief-introduction-to-linear-discriminant-analysis/
  • Feist, A. M., & Palsson, B. O. (2010). The biomass objective function. Current opinion in microbiology, 13(3), 344-349.
  • Ho, Tin Kam (1995). Random Decision Forests (PDF). Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995. pp. 278–282.
  • Kingma, diederik, & Ba, Jimmy (2015). Adam: A Method for Stochastic Optimization. 3rd International Conference for Learning Representations, San Diego, 2015.
  • Klusowski, J. M. (2020, August 13). Analyzing cart. arXiv.org. Retrieved October 11, 2022, from https://arxiv.org/abs/1906.10086
  • Menon, A. (2018, September 19). Linear regression using gradient descent. Medium. Retrieved October 11, 2022, from https://towardsdatascience.com/linear-regression-using-gradient-descent-97a6c8700931
  • NIST Computer Security Resource Center (n.d.). Statistically independent events - glossary: CSRC. CSRC Content Editor. Retrieved October 11, 2022, from https://csrc.nist.gov/glossary/term/statistically_independent_events#:~:text=Two%20events%20are%20independent%20if,%2C%20P(A%20and%20B)
  • Opitz, D.; Maclin, R. (1999). "Popular ensemble methods: An empirical study". Journal of Artificial Intelligence Research. 11: 169–198. doi:10.1613/jair.614
  • Orth, J. D., Thiele, I., & Palsson, B. Ø. (2010). What is Flux Balance Analysis? Nature Biotechnology, 28(3), 245–248. https://doi.org/10.1038/nbt.1614
  • PSU. (2018). 5.4 - a matrix formulation of the multiple regression model. 5.4 - A Matrix Formulation of the Multiple Regression Model | STAT 462. Retrieved October 11, 2022, from https://online.stat.psu.edu/stat462/node/132/
  • Willmott, Cort J.; Matsuura, Kenji (December 19, 2005). "Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance". Climate Research. 30: 79–82.)
  • Yang, K., Tu, J., & Chen, T. (2019). Homoscedasticity: An overlooked critical assumption for linear regression. General psychiatry, 32(5).
  • Zhang, Z., & Sabuncu, M. (2018). Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems, 31.