Team:OUC-China

DISP

Promoter learning

As a new type of fermentation chassis organism, Aureobasidium melanogenum P16 has yet to be developed, so our team hopes to utilize the idea of machine learning to design promoter in silico for future teams and related researchers. Besides with the development of the chassis, models that can predict promoter strength are extremely necessary.
Firstly, the transcriptome data of Aureobasidium melanogenum P16 genome is obtained, and the first 200 bp sequence before each gene are selected as the promoter sequence. To design a qualified promoter in silico requires two modules, the first module is to generate the promoter sequence and another module is to predict the strength of the promoter.

.

At the beginning, the promoter sequence was encoded by sequential coding and according to the principle of natural language processing, the first n are selected as features, and the n+1 bases are used as labels for training. And then they are processed with Embedding layer to increase the feature of the sequences and LSTM layer is utilized to train the model.

Fig1. the pre-processing of DNA sequences

Extract 100 untrained promoter sequences from the dataset as a test set to verify the performance of the model. As this is a multi-classification problem, the confusion matrix is used for analysis(fig2).

According to this confusion matrix(fig2), it can be seen that the wrong number and type of bases are predicted when predicting the number of bases at 100 bp.

Fig2. The confusion matrix of the model

predict the expression

Prior to this, the transcriptome results of Aureobasidium melanogenum P16 obtained is in the format of different expressions at different times (6h, 18h, 36h, 60h, 96h).
The sequence was used as the feature and the 5 expressions were used as labels in a vector, CNN and LSTM neural networks were used for training[1,2]. Because of the large amount of data and the deep network of the model, the training speed is not only slow, but the effect of the model is extremely insignificant. We have encountered the problem here that cannot be solved.

After learning from the previous model, five models are established to predict the expression strength of each time period separately. Same as the previous group model, define the sequence as the feature and the expression strength as the label, the random forest model is utilized to predict.

Pick out 100 untrained sequences to verify the performance of the model(fig3A). Because the expression of genes in living organisms is not very stable, so set the error interval to 20%, and all the data predicted within the error interval are judged to be correct. The two green lines represent the upper and lower bounds of the error interval, the red lines represent the best predictions. Obviously, the results of the test base are not ideal. It is also a problem.

Fig3 A. the prediction of the test base by the model without physical and chemical properties.
Fig.3.B. the prediction of the test base by the model with the physical and chemical properties.

After discussing with DUT-China team on this point, we think that the promoter sequence cannot simply be regarded as an arrangement and combination of 'ATCG', its essence is actually chemical substance. Therefore we should add the physical and chemical properties of the sequence after each promoter sequence to increase the features of them. The test base is same, fig 3B shows the new result of the new model.
Although, the error is still big, but it is obviously increased, the newly predicted data is more closed to the red line.

Fig4. the whole process of the model

Fig4 is the flowchart of the whole process of designing promoter in silico. Each time the promoter strength is predicted and the true transcriptional intensity is detected by wet experiments. Whether the prediction is correct or not, it must be re-import into the model to increase the model data base and improve the model capabilities.

Appeal

Although in this project we did not validate the wet experiment on the model that design promoter in silico. In part because the prediction of promoter strength is still a big problem, the accuracy is very low, and it cannot predict the promoter that have high expression strength. So we hope that in the future more teams can continue our work and add more components to this new fermentation chassis organism.

DISP

A project by the OUC-China & Research iGEM 2022 team.

Contact
mail_outline OUCiGEM@163.com