Model | ICT-Mumbai

Gene expresion is a complex process involving a plethora of variables. Thus, to develop a simple model that can give a prediction of protein abundance based on fewer variable is important to avoid complexity due to high number of variables and also computationally feasible. The purpose of our model is to predict the protein abundance based on the 5’-UTR sequence alone i.e it is our only variable. This model takes in the sequence as the input and the output of the model is a dimensionless quantity relating protein abundance obtained from (Fluorescence) and the cell population (Fluorescence/OD). Thus, we have developed an Artificial Neural Network based approach to predict the abundances using the 5’UTR sequence.

Since the model depends solely on the UTR sequence, it is a deterministic model. It doesn’t account for the stochasticity associated with gene regulation, it assumes it to be a deterministic event. To include stochasticity, we will need to consider the spatio-temporal effects of the expression which is beyond the scope of this project.

Owing to its versatility, efficiency, compatibility and many other advantages, we are using Python, which is a powerful Open-Source programming language, for developing and implementing our model. The main script has been written using Google Colab, however, it can be run by any IDE (Integrated Develepment Environment) that can run python scripts. It can be run locally (Jupyter, Spyder etc.) as well as on Google Colab (preffered since it will have all the needed packages installed).

The requirements of the model are minimal, the only experimental parameter required is the 5’-UTR sequence. Libraries required for the model are: Numpy, Pandas, Scikit Learn and Tensorflow, all of which are open source libraries of Python and can be installed with very basic commands.

The data required for training the model is very specific in nature. It has been verified through literature survey that such kind of input is an appropriate input for such a model.The dependance of the gene expression on the 5’-UTR has been studied in the past and computational models have been developed to study this relation ([1]). This relationship is well etablished, our work aims at quantifying this relation in order to engineer more efficient gene expression systems.

The data which we are using to train our model comes from a study of gene regulation in Yeast [1]. The experiment measured the YFP fluorescence levels by varying the untranslated region (10 bp) upstream of start codon. We extracted the necessary data from the published data into our model using the Pandas library of python. Below shown is the nature of data that is required to run our model:

There have been attempts made in the past to correlate the protein abundance and the 5’-UTR sequence ([1],[2]) using different techniques. Our model differs with them mainly based on the experiments conducted and the data available to correlate the two.

Our model should predict the best possible 5’UTR sequence and we will design the UTR based on the sequence predicted by the model. Then compare the protein abundance with other samples using different UTR sequences.

Below shown is a schematic of how our model works:

The hidden layers have the following features:

These numbers were obtained based on a rough estimation and trial and error method. While training the model, we can look for how good the model performs, for example, below shown is the training data of the model:

The above image shows how well our model trains, how the error between the prediction and real values decreases over the number of runs. It eventually reaches a point where the error saturates, we can stop training at that point to avoid excess computation time. The output of the model is just a number (protein abundance), which helps us compare different sequences. But, in order to check how close our predictions are to the real values, we can run our model and get a pareto plot to see how the predictions work. Below is an example of a pareto plot:

The above plot shows how accurate the predictions are.

This model can be expanded by including more variables but that will require a more complex network architecture and in such a case, a more mechanistic model can be better than a data driven model since obtaining data for so many variables over huge number of data points is just not a feasible operation.

We tried to evaluate our model using the experimental values obtained by the Wet Lab team in their second engineering cycle. Many UTR sequences were tested in the Wet Lab and the best performing UTR sequence at the shortest length of 9 nucleotides cspF was chosen and different point mutations were made by introducing random single point mutations to generate 10 sequences. Fluorescence observations were made for every single one of them and those values are being used by us now to evaluate our model.

The sequences we chose are:
['GGGGATTTTT', 'GGGAGTTTTT', 'GGGAATATTT',
'GGGAATTTTA', 'GGGAATTTTC', 'GGTAATTTTT',
'GGGAAGTTTT', 'GGGAATTTCT', 'GGGACTTTTT', 'GGGAACTTTT']

We evaluated the protein abundances predicted by our model based on one of the above mentioned sequences and we obtained the following values:

Sequence	Model Prediction	Experimental Values
GGGGATTTTT	9.028239	25323.69
GGGAGTTTTT	8.607087	26289.5
GGGAATATTT	8.84179	23897.92
GGGAATTTTA	10.54604	21233.33
GGGAATTTTC	9.16265	25022.68
GGTAATTTTT	8.884349	24731.95
GGGAAGTTTT	8.956186	25799.31
GGGAATTTCT	8.40715	28247.22
GGGACTTTTT	8.6906	25323.69
GGGAACTTTT	8.894659	23812.6

The model has been trained on a different dataset and thus, very obviously the scaling factor of the model is different. The Significance of our model is to show that such an approach using neural networks can help us predict the protein abundances (F/OD600). It can help us accurately predict the gene expression efficiencies.

We show that our model does have some kind of relationship with the experimental data, for that, we use the Pearson’s Correlation Coefficient which is given by:

Where x and y are the values from two datasets we are comparing.

Using python, we have calculated the Pearsons Correlation as:


from scipy import stats

import numpy as np

 
model_predicted_values = np.array([ 9.028239,  8.607087,  8.84179 ,     10.54604 ,  9.16265 ,  8.884349, 8.956186,  8.40715 ,  8.6906  ,  8.894659])

experimental_values = np.array([25323.69, 26289.5, 23897.92, 21233.33, 25022.68, 24731.95, 25799.31,28247.22, 25323.69, 23812.6]) 

p_corr = stats.pearsonr(model_predicted_values ,experimental_values)

This gives us a pearson’s correlation coefficient as -0.84. The pearsons correlation coefficient can go from +1 to -1, indicating two extreme correlations, positive and negative. Our model is strongly correlated with the experimental data at a statistically significant p value of 0.002. The model can very well predict the trend followed by the UTR sequences, which can be a helpful tool to compare given set of sequences and Qualitatively judge them based on just their sequences. This shows that the model we have developed, and works beautifully!

We shall be using our own model in the third engineering cycle design!!!

Following is the Google Colab Script:

Following is the html script of the Google Colab File:

References:
[1] Dvir, S., Velten, L., Sharon, E., Zeevi, D., Carey, L. B., Weinberger, A., & Segal, E. (2013). Deciphering the rules by which 5′-UTR sequences affect protein expression in yeast. Proceedings of the National Academy of Sciences, 110(30), E2792-E2801.
[2] Brown, R. H., Gross, S. S., & Brent, M. R. (2005). Begin at the beginning: Predicting genes with 5′ UTRs. Genome research, 15(5), 742-747.