Model

Background

This project plans to display chitosanase on the surface of E. coli cells for hydrolyzing chitosan. The higher the chitosanase content on the cell surface, the more helpful it is to hydrolyze chitosan. The chitosanase gene on the plasmid we constructed in E. coli needs to come from other species, and chitosanase also forms a fusion protein with InaK-N. in addition, we hope that the chitosanase from other species can form a fusion protein with higher content which suitable for expression in E. coli. Therefore, the species source of the chitosanase gene is very important.

Purpose

Use modeling to select proper chitosanase which is suitable for expressing by cell surface display on E. coli.

Method

Suppose

Figure 1. Genetic central dogma

Model Making

Some literature has reported the relationship between multiple mRNA sequences and the corresponding protein yield. We use a machine learning model to establish the corresponding relationship between sequence information and its target protein yield (Figure 2). Using sequence features related to stability and translation efficiency (minimum free energy CAI, etc.), the one-to-one correspondence between the sequence and its protein yield was trained by a random forest regression model (Figure 3).

Figure 2. Establishment and application of protein yield prediction model

Figure 3. Partial code for modeling

Model evaluation results

Mean Absolute Error: 6.004875, Root Mean Squared Error: 7.646747, the error difference between Root Mean Squared Error and experiment is less than 2%. The accuracy of the model is verified.

Result

At present, we know that many species contain Chitosanase genes, including Bacillus amyloliquefaciens, Linderina pennispora, Bacillus thuringiensis, etc., using models to predict the yield of corresponding fusion proteins:

Species Predicted expression
Bacillus amyloliquefaciens 93.102
Bacillus sonorensis 94.018
Bacillus halotolerans 93.992
Streptomyces olivaceus 94.524
Linderina pennispora 94.402
Bacillus thuringiensis 97.55

The results showed that the yield of the fusion protein corresponding to Chitosanase from different species was different, and the yield of the fusion protein corresponding to Chitosanase from Bacillus thuringiensis was the highest. Therefore, the Chitosanase gene sequence from Bacillus thuringiensis was considered in this project.

Discussion

In this model, the machine learning method is used to predict the yield of the target protein in order to assist in the selection of a suitable Chitosanase gene sequence. The prediction results of the model showed that the Chitosanase gene from Bacillus thuringiensis formed a fusion protein of InaK-Chitosanase with a high yield, which could hydrolyze chitosan more efficiently. However, the disadvantage is that InaK-Chitosanase is displayed on the cell surface for hydrolyzing chitosan, and the process of fusion protein presentation to the cell surface has not been considered in this model. But the results of this model still can be used as a reference for the selection of the Chitosanase gene.

References

Combinatorial optimization of mRNA structure, stability, and translation for RNA-based therapeutics
Insights into promiscuous chitosanases: the known and the unknown
Surface Immobilization of Human Arginase-1 with an Engineered Ice Nucleation Protein Display System in E. coli