Mathematical modelling
P art 1: Overview of Gaussian process
In this project, we utilize CRISPR-Cas system to cut the dsDNA to produce fluorescence inside a cell-free system. And each component inside cell-free system wh ich includes triggerRNA, dsDNA, igRNA, probe, and Cas12a has its own unique impact on the final fluorescence produced.
Dif ferent variables present will influence the final fluorescence produced in different ways. To conclude the impact of each components on the final fluorescence produced, we use the G a ussian Process (GP) to distinguish the relationship between the final fluorescence produced and the concentration of each component in the cell-free system. In th e previous experiment, we obtain ed data about the impact of five components on the final fluorescence produced.
H ere is a table to show the previous data obtained.
Table 1: The data we obtained (Note: T h e unit of each solution is microliter, fluorescence is the relative lightness of tube obtained by software ImageJ. In the note group, if it is marked with blank, then it is the blank control group)
Part 2: The introduction of Gaussian Process
Gaussian Process (GP) is based on a collection of random variables. Finite number of variables have a joint Gaussian distribution. That is, it describes the probability distribution over functions
, which own a finite number of fuction variables
.
By definition, Gaussian Process is specified by a mean function and a covariance function. [1], [2]
W
e could understand how Gaussian Process works by first looking from Gaussian Distribution.
F
or 1-D Gaussian distribution, we can construct a continuous function to represent its probability, termed probability density function. As its name suggests, one single point on this function symbolizes the possibility density. When the probability density has increased, there is an increasing probability for a point to be nearby. Such a function could be expressed as following:
And:
F or 1-D Gaussian distribution, we can construct a continuous function to represent its probability, termed probability density function. As its name suggests, one single point on this function symbolizes the possibility density. When the probability density has increased, there is an increasing probability for a point to be nearby. Such a function could be expressed as following:
Th e graph of the formula could be constructed as:
G raph 1: The 1-D Gaussian Distribution
T here are two parameters in this function, and , where symbolized the mean value and determines the variance. Sample variance could be calculated by the following equation:
T hen we could incorporate this model into higher dimensions. That is, w e can build a model containing different variables and see how each variable influences the results. In each variable, there is a unique and . At the same time, there is a new variable, covariance, to indicate the dependency of one variable on another variable. This could be denoted as:
A positive covariance value indicates that 2 variables have a positive correlation, when they are plotted on a graph with one variable serving x-axis and the other variable serving y-axis, the final graph plotted should have following shape:
Graph 2: The shape of 2 variables with positive covariance
In order to show the Gaussian distribution in a higher dimension, we have to adopt a different way to represent the data. If we plot the mag nitude of coordinate points independently in y-axis and place the variable index in x-axis, we could show the higher dimension Gaussian distribution. Each point’s magnitude on the vertical axis represents the corresponding value in that dimension. For example, if one sample point in the 2-D Gaussian distribution is (2,4), then by utilizing this mean of presentation, it can be expressed by (1,2), (2,4), given that the index of dimension starts from 1. The first value denotes the index of dimension s , and the second denotes the corresponding value.
U sing this method, we could present the higher dimensional Gaussian distributions in 2-D plane. As shown in the graph is a sample from a 25-dimension Gaussian distribution.
G
raph 3: The representation of a sample from Gaussian distribution in 25 dimensions
T
hen if we understand the relationship between the distance of different dimensions and the covariance, then we could fix the value of several dimensions and obtain the predicted distribution in the remaining dimensions.
I
n the following graph’s example, the closer between the dimension, the larger the magnitude of covariance. Thus, in a dimension that is distant from known dimensions, the variance would be significantly higher, and the predicted mean value would be closer to the 0 as a result. Like what in the following graph the last few dimensions exhibit.
G raph 4: Altered Gaussian distribution given known values of certain indices
F urthermore, we could extend the model from the multi-dimension Gaussian distribution to Gaussian process. That is to say, to express the data with a continuous function but not limited to integer indices. We could calculate the covariance between any two points and there are infinite pairs of covariance , resulting in an infinite covariance matrix .
T he Gaussian process has infinite number of variables, influenced by the mean value function and covariance function (kernel function) .
T
hen we could predict the value on non-integer point by using Gaussian Process. Gaussian Process can calculate predicted mean value and variance of a point given the known data. The following graph is a typical result of Gaussian Process. Given the observation points, the dark blue line marks the predicted mean value of each point and light blue shaded area represents possible distribution of points.
G raph 5: A sample of a Gaussian process
D ata points we gain serve as anchors to fix the function at specific points. The following graph shows the distribution of data after given two observed points.
Graph 6: The effects of observable data points on fixing GP [1]
Part 3: Data Processing and Assumption
F rom the previous data (Table 1) , we further make the following modification to make the data more sensible and meet the experimental need better:
M odification 1: First, because data from different groups is obtained in different days, the background lightness varies. Thus, in order to mitigate the impact of the background lightness, we make the following operation in each group:
Step 1: We calculated the mean value of fluorescence of negative control group and uses it to estimate the value of background fluorescence.
Step 2: We then do the following operation to estimate the relative fluorescence in each group.
T he background mean value of fluorescence is 136.498 and the final standard error of the result is 8.522.
T able 2: The adjusted value of results
At the same time, some assumptions are made throu ghout the modelling process which are listed below with necessary justifications.
Assumption 1: Each variable’s impact on the total fluorescence is independent from each other.
J ustification 1: No specific chemical reactions will happen between components inside the system except the overall reaction to produce fluorescence. This indicates that each variable can be fit into one dimension in Gaussian Process without significant interaction between variables.
A ssumption 2: The background lightness should be considered as almost the same throughout the experimental process.
Justification 2: Although the experiments are done in different days, the lightness condition is held the same throughout the whole process. This is because the only available light source in the laboratory is the artificial light which is constant in lightness. Also, the image processing pro cedure is kept constant as well with the help of software ImageJ.
At the same time, the results reveal that standard error of the data is just 8.522 which is not so significant, showing that the background lightness is almost constant.
Assumption 3: The impact of amount of buffer and water to final fluorescence should be considered as uniform as long as they are added. The only variables causing changes in fluorescence are different concentration of triggerRNA, dsDNA, igRNA, probe, and Cas12a protein.
J ustification 3: T h e effects of buffer added is merely controlling the pH value of the solution constant. If it is added, no matter how much amount of it is present, the effect of the buffer is considered to be the same.
The effects of the nuclease-free water added is only to hold the total amount of liquid constant. Thus, it has no individual effect on the final fluorescence since nuclease-free water does not take part in the reaction directly.
Assumption 4: Each component inside the system including Cas12a Protein, triggerRNA, dsDNA, igRNA, and probe plays an irreplaceable function inside the system. In negative control groups which one of the components is absent, no fluorescence can be produced.
Justification 4: Back to the description part, we know that each component does a unique dedication towards system. Cas12a protein cleavages dsDNA. IgRNA serves as a switch to turn on the system. And TriggerRNA serves as a “key” to initiate the system and its subsequent reaction. At the same time, only when probe is present can fluorescence be produced.
Part 4: algorithm implementation
N ote: All of the programming software is based on Python 3.7.9 edition
I n the Python, we have a function named as GaussianProcessRegressor from sklearn.gaussian_process package which could help us achieve the effect of regre ssion in the Gaussian Process.
I n the realization of regression, we have to predict the continuous quantity from an input. Then the system we predicted contains the following two parts:
a. A systematic variation which can be accurately predicted by a function
b. A unavoidable and unpredictable noise inside the system
T hus, the final result can be expressed by following equation:
Furthermore, in linear regression, we constrain the relationship between observation and output to be linear. Thus, d if ferent combination of kernels is utilized to achieve the wanted result. Following is a brief introduction of various common kernels we used in the GP (Gaussian Process).
A . RBF (Radial Basis Function) Kernel
It is expressed in mathematical formula as
means the length scale of kernel and is the Euclidian distance. should be always longer than zero. And Euclidian distance can be calculated using the following formula:
This kernel is infinitely differentiable and owns a very smooth shape. In this kernel,
L
controls the degree of changes of covariance. And another variable
is multiplied before the K to control the magnitude of the effect. [5]
T he following graph shows the curve of a typical RBF kernel.
Graph 7: A typical curve of an RBF kernel
B. Matérn kernel
Matérn kernel is a generalization of the RBF kernel. In addition to RBF kernel, another variable controls the smoothness of the result.
is the modified basal function and is the gamma function. In the Python, the Matérn kernel inclu des three possible inputs that can be added including length_scale, length_scale_bounds, and nu. Nu is controlling the smoothness of the whole curve. [5]
C . RationalQuadratic Kernel
T he RationalQuadratic kernel is based on RBF kernel with combination of different length scales. The following equation shows this kernel.
T
he two possible inputs that can be added in this kernel is the length scale parameter and the mixture parameter
with their relative boundaries. It is noteworthy that both inputs are scaler. [5]
D . Other possible kernels
There are other kernels present such as ESS (ExpSineSquared) kernel, DotProduct kernel, Whitekernel, and so on. Different combinations of kernel may be utilized to achieve the satisfactory results.
After v arious tests, we find that a combination of RBF kernel, Matérn kernel, and RationalQuadratic kernel fits our data the best and produces the most reasonable and satisfying outcome. The resulting parameter of the kernel inside Gaussian Process is the following :
k1 0.0293 ** 2 * RationalQuadratic ( alpha = 5.63 , length_scale = 0.00551 ) + 17.9 ** 2 * RBF ( length_scale = [ 1e+05 , 4.45 , 0.0522 , 0.0934 , 0.671 ])
k2 20.8 ** 2 * Matern ( length_scale = 3.41e+04 , nu = 1.5 )
T hen the final results of the Gaussian Process are shown as following graphs:
N ote: The light blue area shows the predicted distribution of the fluorescence level with changes in different variables ( ) , while the yellow points show the actual fluorescence produced by the system in the experiment.
Gr aph 8: The predicted impact of triggerRNA on final fluorescence
Gr aph 9: The predicted impact of dsDNA on final fluorescence
Gr aph 10: The predicted impact of igRNA on final fluorescence
Gr aph 11: The predicted impact of probe on final fluorescence
Gr aph 12: The predicted impact of Cas12a Protein on final fluorescence
P art 5: The interpretation and conclusion drawn from final results
From the graphs above, we can draw a number of the interpretation from the results:
For triggerRNA: triggerRNA seems not to impact the final fluorescence significantly as long as it is added into the system . In the graph, a straight line parallel to x-axis demonstrates the relationship between the triggerRNA and fluorescence level. Thus, the concentration of triggerRNA doesn’t impact the final result a lot. To sum up, if it is not added to the system, no fluorescence is produced. If it is added to the system, then fluorescence is produced which is not impacted by concentration.
For d sDNA: In the graph, the relationship between dsDNA concentration and fluorescence produced is a lin ear relationship with a positive correlation. This shows that in a certain level, the higher the concentration, the more significant the fluorescence will get. This relationship could be demonstrated as following.
i gRNA: From graph 10, we understand that the igRNA seems not to impact the final fluorescence in a linear way. On the contrary, without igRNA, no fluorescence will be produced. When it is added in small amount, the fluorescence produced increases a lot. However, later, when enough igRNA is added, the growth in fluorescence seems to be stagnated or even curtailed a little bit. This proves that above a certain value, the concentration of igRNA is no longer the decisive factor in the production of flu orescence.
P robe: The function of probe in the final fluorescence is similar to the effect of igRNA according to the result of modelling process. A slight change in the concentration of probe can cause a dramatic change in the final fluorescence at first. However, later, no effect or even adverse effect are witnessed on the increasing concentration of probe.
C as12a Protein: Cas12a Protein is an indispensable part inside the system. Without Cas12a protein, no fluorescence is seen. And increases in the concentration of Cas12a Protein first cause an increase in the final result. However, later when Cas12a Protein is added in excess, the fluorescence level seems to be declining with more Cas12a Protein added.
R eflects for possible defects of the system:
First, final predicted results of the Gaussian Process may be not completely correct . For example, in the resulting outcome for triggerRNA (Graph 1), most points fall out of the predicted range ( ) . This might show that the final model could not deliver satisfactory outcome among all variables. This is surely a limitation of the model.
In order to overcome this limitation, we have to get access to more data points for the concentration of triggerRNA to obtain a more reliable outcome.
Second, no repetition for experiments is made throughout the process.
T he experiment is finished only once in each concentration. Some occasionality still exist inside the experiment to make the final result not so reliable. In order to overcome this drawback, it is better to repeat each trial for three times and average to calculate the mean value.
T
hird
, in the graph 8 and 9 (The predicted result for triggerRNA and dsDNA), the confidence interval doesn’t pass through the origin point. However, based on the
assumption 4
, in negative control groups when one of the components are absent, there is no fluorescence produced. Thus, all confidence interval are supposed to pass through (0,0). However, in
interval
(
68.27%), neither graph 8 (triggerRNA) nor graph 9 (dsDNA) passes through the origin. In the
confidence interval (95.45%), the graph 9 group
(
dsDNA) covers origin point, but graph 8 group (triggerRNA), still, doesn’t pass through origin point
.
I
n order to find out the reason for why it behaves like this, more experiments have to be carried out certainly.
Appendix A: The code used in Gaussian Process
# All of the programming software is based on Python 3.7.9 edition
%pip install numpy pandas matplotlib scikit-learn rich
import pandas as pd
from rich import print
data = pd.read_csv(
"./clean_data.csv"
)
#upload this in advance
data
bg = data[data[ "note" ] == "blank" ][ "fluorescence" ].mean()
e_std = data[data[ "note" ] == "blank" ][ "fluorescence" ].std()
print ( f "Fluorescence background mean: {bg: .3 f} , method std error: {e_std: .3 f} " )
data[ "cal_fluorescence" ] = data[ "fluorescence" ] - bg
# data.loc[data["note"] == "blank", "cal_fluorescence"] = 0
data
print (data)
import csv
data.to_csv( 'data.csv' )
#save the csv document
import numpy as np
import matplotlib.pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ExpSineSquared, DotProduct, Matern, RationalQuadratic
# RBF+RationalQuadratic+Matern serves as kernel
kernel = 1.0 *RationalQuadratic()+ 1.0 * RBF(length_scale=np.ones( 5 ), length_scale_bounds=( 1e-2 , 1e5 ))+ 1.0 * Matern(length_scale= 1.0 )
#The following is possible kernels that can be utilized, remember to time 1.0 in the beginning to make sigma a changeable variable
#1.0 * RBF(#length_scale=np.ones(5), length_scale_bounds=(1e-2, 1e5))
#1.0 * Matern(length_scale=1.0)
#1.0 * DotProduct(sigma_0=1.0)
#1.0 * RationalQuadratic()
#1.0 * ExpSineSquared()
# The first five columns are served as independent variable
X_train = data.iloc[:, : 5 ].to_numpy()
# The fluorescence produced served as dependent variable
Y_train = data[ "cal_fluorescence" ].to_numpy()
gp = GaussianProcessRegressor(kernel=kernel, alpha=e_std** 2 , n_restarts_optimizer= 99 )
gp.fit(X_train, Y_train)
print (data.columns[: 5 ].to_list())
for k in gp.kernel_.__dict__:
print (k, gp.kernel_.__dict__[k])
# assess the results
mean, std = gp.predict(X_train, return_std= True )
data[ "Pred_mean" ] = mean
data[ "Pred_std" ] = std
data
# Plot the graph for results
keys = list (data[ "group" ].unique())
for k in keys:
plt.plot(
data[data[ "group" ] == k][k], data[data[ "group" ] == k][ "Pred_mean" ], label=k
)
plt.fill_between(
data[data[ "group" ] == k][k],
data[data[ "group" ] == k][ "Pred_mean" ] - data[data[ "group" ] == k][ "Pred_std" ],
data[data[ "group" ] == k][ "Pred_mean" ] + data[data[ "group" ] == k][ "Pred_std" ],
alpha= 0.2 ,
)
plt.xlabel(k)
plt.ylabel( "Predicted mean fluorescence" )
plt.legend()
plt.tight_layout()
plt.show()
Appendix B: Reference Page
[1] Daniel McDuf, Gaussian Process, MIT media lab , https://courses.media.mit.edu/2010fall/mas622j/ProblemSets/slidesGP.pdf
[ 2] Robotic Knowledgebase, Guassian Process and Gaussian Mixture Model, https://roboticsknowledgebase.com/wiki/math/gaussian-process-gaussian-mixture-model/
[3] Carl Edward Rasmussen, Christopher K. I. Williams (2006). “Gaussian Processes for Machine Learning”. The MIT Press.
[ 4] David Duvenaud (2014). “The Kernel Cookbook: Advice on Covariance functions”.
[5] Scikit-learn: Machine Learning in Python, Pedregosa et al. , JMLR 12, pp. 2825-2830, 2011.