Predicting concentration of bacteria based on experiment data
Method 1: Single channel linear regression
First we only consider the relationship between the concentration and Blue channel value detected
from the paper-base detection.
We take as x axis, the B value as y axis, and make linear regression
based on observations in
experiment, and we get the linear function after fitting. We chose the B value to do linear
regression because it has the most obvious linear feature among all three RGB channels.
B-concentration regression for EC
B-concentration regression for VP
B-concentration regression for SA
B-concentration regression for SE
Method 2: Multi channel linear regression
Linear regression of single channel only uses a small part of the data and occurs much deviation. In
order to make full use of known RGB values, we use multi channel data to perform linear regression.
And we get the linear relationship between
and RGB values. We chose as
z axis, instead of . That’s
because in single channel linear regression we didn’t consider the control group, whose
concentration is zero. Using instead of ,
we kept the zero point unmoved after transformation while
only scarifying a little precision elsewhere.
Since RGB values along with concentrations can’t be put on to figure at the same time, we used PCA
method to project RGBs to the XY plane.
The way that PCA works can be interpreted as follows: Imagine the way we observe a three-dimensional
object from an appropriate direction to capture as much information as possible. For certain angle
of observation, despite the fact that the image you get is a two-dimensional figure, you can still a
thorough understanding of what the object looks like. PCA is a deterministic mathematical operation
to perform such projection from higher dimensions to lower dimensions, and that’s how we perform the
transformation from RGBs to XYs.
A
3D
figure
Its 2D project
And the result from multi-channel linear regression are listed below:
EC:log(C+1) = 0.02240R -0.00783G +0.09002B -5.92404
VP:log(C+1) = 0.02308R -0.01599G -0.13322B +26.49185
SA: log(C+1)= 0.16697R -0.22322G -0.07831B +33.72686
SE:log(C+1) = 0.08613R +0.02074G -0.26186B +36.32994
Through linear regression with multi-channel color data, the prediction is more accurate than the
result of single channel linear regression.
Method 3:Gaussian Process regression
Though multi-channel linear regression can produce more reliable results, but it still has flaws. It
can’t tell us the deviation in prediction. A powerful math tool, called Gaussian Process, can solve
this problem.
If you are not familiar with Gaussian Process regression, here is an intuitive explanation to it:
We assume there is relationship between the data and the whole input space follows the same Gaussian
distribution. Consider the RGB values detected from the paper-base experiments under different
concentrations as a set of inputs. “Insert” it into a large Gaussian Process which can be considered
to have infinite data from prior distribution. The core of Gaussian Process is the pre-selected
kernel function which can be used to figure out the covariance (or covariance matrix) between
inputs. Covariance reflects the correlation between input data. Through the result of linear
regression, we know that a linear relation is obtained. So we don’t need to use arbitrary smooth
function to fit, otherwise it will be over-fitting, which means that the model will be overwhelmed
by provided data and loses part of the reliability when making predictions on other data points.
Thus, we chose the dot product kernel to get a generalized linear function family to fit. By
adjusting the parameters in the kernel (such as
), Gaussian process can predict
cases of different
concentration more accurately. A great ability of Gaussian Process is to predict the mean and
variance at the same time, which is based on the “distance” between the input to be predicted and
previous samples. When the input is close to a sample (for example, predicting on 10⁶CFU/mL with
aspect to the previously known 1.1×10⁶CFU/mL), the prediction will refer more to the nearby samples
and get information from them, and will be getting less information thus being less accurate
otherwise.
Here is how we perform Gaussian Process regression in three major steps:
- ①Assume that sample data follow the empirical rule that if the inputs are similar, the outputs
are also similar.
- Explanation: For the color (R,G,B) resulted from hydrolysis of chromogenic substrate, if the
concentration of bacteria changes a bit, the concentration of enzyme changes a bit, so the
degree of hydrolysis also changes a bit. Therefore, we consider that the data follow such rule,
which means we can use covariance in Gaussian Process to indicate the relationship, then predict
the result.
- ②Predict the result by Gaussian Process
- We choose an appropriate kernel composition (dot product kernel: and white noise kernel as the
experiment results inevitably have deviation), and use Gaussian Process Regression in
sklearn.gaussian_process module (given that the prior mean is 0) to fit the existing data. Using
Gaussian Process, we successfully predicted the relationship between given RGB value (input) and
concentration (output) and got upper bound and lower bound of the 95% confidence interval.
Color PC1 and PC2 are the value of R,G,B transformed by PCA.
The red points are the results in experiments, and the orange one is the color provided by the user
(it doesn’t influence the Gaussian Process). The image is a screenshot from a website we
constructed. There will be an introduction of this website at the bottom of the document.
Gaussian Process Regression for EC
Gaussian Process Regression for VP
Gaussian Process Regression for SA
Gaussian Process Regression for SE
- ③Prediction in larger extent
- To predict the result in larger extent, we use Gaussian Process to predict the concentration
mean values and standard deviations of all color in the whole RGB space, and plot the result in
three-dimensional coordinates according to the RGB values. The color of each point (a
determinate RGB value) indicates mean concentration in the left picture and indicates standard
deviation on the right.
RGB space for EC
When the color of paper base is closer to |
the mean becomes |
the standard deviation becomes |
the prediction get more |
red |
higher |
lower |
reliable |
yellow |
lower |
lower |
reliable |
green |
lower |
higher |
unreliable |
blue |
higher |
lower |
reliable |
RGB space for VP
When the color of paper base is closer to |
the mean becomes |
the standard deviation becomes |
the prediction get more |
blue |
higher |
lower |
reliable |
black |
higher |
lower |
reliable |
red |
higher |
higher |
unreliable |
green |
higher |
lower |
reliable |
RGB space for SA
When the color of paper base is closer to |
the mean becomes |
the standard deviation becomes |
the prediction get more |
yellow |
higher |
lower |
reliable |
blue |
higher |
lower |
reliable |
red |
higher |
higher |
unreliable |
green |
lower |
higher |
unreliable |
RGB space for SE
When the color of paper base is closer to |
the mean becomes |
the standard deviation becomes |
the prediction get more |
purple |
higher |
lower |
reliable |
blue |
higher |
lower |
reliable |
yellow |
higher |
higher |
unreliable |
red |
lower |
higher |
unreliable |
We coded a website which allows reader to see the model directly. As it is coded in Python and relies
on such backend, we are not able to directly include it in our Wiki. So we offer a link to the
website, you can check it out here
to see the
result.(More introduction is below)
After comparing different modeling methods, we chose Gaussian Process Regression as our result.
Compared to single- or multi-channel linear regression, it utilizes experiment data in a deeper way.
While it appears to have less accurate fitting on given data points compared to the multi-channel
linear regression, it offers the ability to predict standard deviation as well, which can help us
determine whether the prection is reliable and thus should be taken into consideration. Being able
to predict of mean and standard deviation means a broader usability, which provides prediction for
any possible case. In this case, prediction of concentration can be more accurate and offers more
evidence for future study.
About the website
At the top of the website, you can select the pathogen you are going to detect. Then you can adjust
the confidence interval by adjusting the slider value.
Then you can pick the color you see on the paper-base experiment, after a short moment of
calculation, the website will return the predicted concentration, as shown on the image.(The color
you pick will appear as an orange point in the figures.)
You can expand the “Raw data” section to check the data we get from our experiments, which serves as
a training dataset for Gaussian Process regression.
Under the “Raw data” section is “Gaussian process regression” section, and you can click the button
and wait for a while to see the same results which we’ve introduced.
Hold the left mouse button and drag, you can see the 3D figure from different angles. The orange
point is the input color to be predicted. Then you can make judgement by the location of the orange
point, knowing the approximate mean and standard deviation.