Model | GYHS - iGEM 2022

Predicting concentration of bacteria based on experiment data

Method 1: Single channel linear regression

First we only consider the relationship between the concentration and Blue channel value detected from the paper-base detection.

We take as x axis, the B value as y axis, and make linear regression based on observations in experiment, and we get the linear function after fitting. We chose the B value to do linear regression because it has the most obvious linear feature among all three RGB channels.

B-concentration regression for EC

B-concentration regression for VP

B-concentration regression for SA

B-concentration regression for SE

Method 2: Multi channel linear regression

Linear regression of single channel only uses a small part of the data and occurs much deviation. In order to make full use of known RGB values, we use multi channel data to perform linear regression. And we get the linear relationship between and RGB values. We chose as z axis, instead of . That’s because in single channel linear regression we didn’t consider the control group, whose concentration is zero. Using instead of , we kept the zero point unmoved after transformation while only scarifying a little precision elsewhere.

Since RGB values along with concentrations can’t be put on to figure at the same time, we used PCA method to project RGBs to the XY plane.

The way that PCA works can be interpreted as follows: Imagine the way we observe a three-dimensional object from an appropriate direction to capture as much information as possible. For certain angle of observation, despite the fact that the image you get is a two-dimensional figure, you can still a thorough understanding of what the object looks like. PCA is a deterministic mathematical operation to perform such projection from higher dimensions to lower dimensions, and that’s how we perform the transformation from RGBs to XYs.

A 3D figure Its 2D project

And the result from multi-channel linear regression are listed below:

EC:log(C+1) = 0.02240R -0.00783G +0.09002B -5.92404

VP:log(C+1) = 0.02308R -0.01599G -0.13322B +26.49185

SA: log(C+1)= 0.16697R -0.22322G -0.07831B +33.72686

SE:log(C+1) = 0.08613R +0.02074G -0.26186B +36.32994

Through linear regression with multi-channel color data, the prediction is more accurate than the result of single channel linear regression.

Method 3:Gaussian Process regression

Though multi-channel linear regression can produce more reliable results, but it still has flaws. It can’t tell us the deviation in prediction. A powerful math tool, called Gaussian Process, can solve this problem.

If you are not familiar with Gaussian Process regression, here is an intuitive explanation to it:

We assume there is relationship between the data and the whole input space follows the same Gaussian distribution. Consider the RGB values detected from the paper-base experiments under different concentrations as a set of inputs. “Insert” it into a large Gaussian Process which can be considered to have infinite data from prior distribution. The core of Gaussian Process is the pre-selected kernel function which can be used to figure out the covariance (or covariance matrix) between inputs. Covariance reflects the correlation between input data. Through the result of linear regression, we know that a linear relation is obtained. So we don’t need to use arbitrary smooth function to fit, otherwise it will be over-fitting, which means that the model will be overwhelmed by provided data and loses part of the reliability when making predictions on other data points. Thus, we chose the dot product kernel to get a generalized linear function family to fit. By adjusting the parameters in the kernel (such as ), Gaussian process can predict cases of different concentration more accurately. A great ability of Gaussian Process is to predict the mean and variance at the same time, which is based on the “distance” between the input to be predicted and previous samples. When the input is close to a sample (for example, predicting on 10⁶CFU/mL with aspect to the previously known 1.1×10⁶CFU/mL), the prediction will refer more to the nearby samples and get information from them, and will be getting less information thus being less accurate otherwise.

Here is how we perform Gaussian Process regression in three major steps:

①Assume that sample data follow the empirical rule that if the inputs are similar, the outputs are also similar.: Explanation: For the color (R,G,B) resulted from hydrolysis of chromogenic substrate, if the concentration of bacteria changes a bit, the concentration of enzyme changes a bit, so the degree of hydrolysis also changes a bit. Therefore, we consider that the data follow such rule, which means we can use covariance in Gaussian Process to indicate the relationship, then predict the result.

②Predict the result by Gaussian Process: We choose an appropriate kernel composition (dot product kernel: and white noise kernel as the experiment results inevitably have deviation), and use Gaussian Process Regression in sklearn.gaussian_process module (given that the prior mean is 0) to fit the existing data. Using Gaussian Process, we successfully predicted the relationship between given RGB value (input) and concentration (output) and got upper bound and lower bound of the 95% confidence interval.

Color PC1 and PC2 are the value of R,G,B transformed by PCA.

The red points are the results in experiments, and the orange one is the color provided by the user (it doesn’t influence the Gaussian Process). The image is a screenshot from a website we constructed. There will be an introduction of this website at the bottom of the document.

Gaussian Process Regression for EC

Gaussian Process Regression for VP

Gaussian Process Regression for SA

Gaussian Process Regression for SE

③Prediction in larger extent: To predict the result in larger extent, we use Gaussian Process to predict the concentration mean values and standard deviations of all color in the whole RGB space, and plot the result in three-dimensional coordinates according to the RGB values. The color of each point (a determinate RGB value) indicates mean concentration in the left picture and indicates standard deviation on the right.

RGB space for EC

When the color of paper base is closer to	the mean becomes	the standard deviation becomes	the prediction get more
red	higher	lower	reliable
yellow	lower	lower	reliable
green	lower	higher	unreliable
blue	higher	lower	reliable

RGB space for VP

When the color of paper base is closer to	the mean becomes	the standard deviation becomes	the prediction get more
blue	higher	lower	reliable
black	higher	lower	reliable
red	higher	higher	unreliable
green	higher	lower	reliable

RGB space for SA

When the color of paper base is closer to	the mean becomes	the standard deviation becomes	the prediction get more
yellow	higher	lower	reliable
blue	higher	lower	reliable
red	higher	higher	unreliable
green	lower	higher	unreliable

RGB space for SE

When the color of paper base is closer to	the mean becomes	the standard deviation becomes	the prediction get more
purple	higher	lower	reliable
blue	higher	lower	reliable
yellow	higher	higher	unreliable
red	lower	higher	unreliable

We coded a website which allows reader to see the model directly. As it is coded in Python and relies on such backend, we are not able to directly include it in our Wiki. So we offer a link to the website, you can check it out here to see the result.(More introduction is below)

After comparing different modeling methods, we chose Gaussian Process Regression as our result. Compared to single- or multi-channel linear regression, it utilizes experiment data in a deeper way. While it appears to have less accurate fitting on given data points compared to the multi-channel linear regression, it offers the ability to predict standard deviation as well, which can help us determine whether the prection is reliable and thus should be taken into consideration. Being able to predict of mean and standard deviation means a broader usability, which provides prediction for any possible case. In this case, prediction of concentration can be more accurate and offers more evidence for future study.

About the website

At the top of the website, you can select the pathogen you are going to detect. Then you can adjust the confidence interval by adjusting the slider value.

Then you can pick the color you see on the paper-base experiment, after a short moment of calculation, the website will return the predicted concentration, as shown on the image.(The color you pick will appear as an orange point in the figures.)

You can expand the “Raw data” section to check the data we get from our experiments, which serves as a training dataset for Gaussian Process regression.

Under the “Raw data” section is “Gaussian process regression” section, and you can click the button and wait for a while to see the same results which we’ve introduced.

Hold the left mouse button and drag, you can see the 3D figure from different angles. The orange point is the input color to be predicted. Then you can make judgement by the location of the orange point, knowing the approximate mean and standard deviation.