Sidebars examples

To guarantee the successfulness of our project, we follow strictly with the standard engineering cycle: Design → Build → Test → Learn → Design...

We performed at least three iterations of engineering loops, which are summarized in the TEST and LEARN section.

Research

We began the project by extensively searching the literature, and we also consulted doctors and professors in related fields.


Medical needs

At the same time, we posted questionnaire to investigate the researchers about problems that need to be solved in the Enzyme Engineering and Genetic Engineering Lab.

As an effective drug for the treatment of type Ⅱ diabetes, the huge demand of acarbose is increasing year by year. Therefore, we need to find a cell transformation scheme to expand the production of acarbose by actinomycetes.



Experimental needs

At the same time, we posted questionnaire to investigate the researchers about problems that need to be solved in the Enzyme Engineering and Genetic Engineering Lab.

Design

When building our model, we consider the following requirements for interactive platform design.



Pre-experiment Data Gathering

Metabolism is the superposition of many chemical reactions. There are multiple pathways in Actinoplanes sp. SE50/110 to produce and consume acarbose. For wet-lab researchers, it takes up a great deal of time to obtain a comprehensive understanding of the metabolic pathways.

Key enzymes positioning

The metabolism of acarbose involves dozens of chemical reactions and a large number of enzymes, and any reaction flux (or metabolic rate) has a basic constraint: the flux cannot exceed the maximum rate of the reaction (vmax), and vmax is equal to the intracellular concentration of the corresponding enzyme multiplied by the enzyme's conversion coefficient (Kcat value). Therefore, it is important to find key enzymes that have a greater role in the reaction.

Modification sites positioning

The chemical nature of most enzymes are proteins, therefore they have complex high-level structures. Also chemical substance molecules have a spatial structure, and enzymes and substrates react in three-dimensional coordinates. The effect of active site alteration on the catalytic function of the enzyme may be significant. However, this requires a large number of pilot experiments, each of which requires a long experimental period, and the results are easily affected by the experimental procedure.


When designing our software platform, we consider the following tools as technical support.

Back end


We used pytorch as the framework of deep learning, with the help of numpy, rdkit to perform auxiliary operations. We also use MySQL and MariaDB as database analysis tools.

Front-End


We use HTML, CSS and Javascript language, with the help of Bootstrap framework to complete the structure of the web page, page rendering and other work.


Build

In order to help biological researchers better design experiment protocols,we built a universal model framework. Here we provide two practical and powerful model in our software to help researchers better locate key enzymes and provide the retrofitting recommendations.

Details of the modelling principles and process can be found on the modelling page. click here!

1. From gem to ecgem

The genome-wide metabolic model iYLW1028 of Actinoplanes sp. SE50/110 has been developed. In order to quantitatively analyze acarbose production while incorporating information on the catalytic functions of different types of enzymes, we introduce the GECKO [2] (GEM with Enzymatic Constraints using Kinetic and Omics data) method to add enzymes constrain in GEM, so we can limit the spatial variability and improve the prediction results.

In this way, we considere the expectation constraints for each reaction flux and extend the GEM of Actinomycetes to ecGEM.

2. From ecGEM to DL-ecGEM

When filling in the parameter Kcat values for the ecGEM model of Actinomyces, a large number of parameters are missing from the database. To predict vitro kcat value, method of Deep Learning is developed by combining a graph neural network (GNN) for substrates and a convolutional neural network (CNN) for proteins. After obtaining the substrate molecular vector representation and protein sequence vector representation, we concatenate them with the unique output Kcat value to train the deep learning model. This allows us to use the substrate structure, substrate name and protein sequence as inputs to predict their corresponding unique Kcat values.

Therefore, we optimized the ecGEM to DL-ecGEM to make the model simulation more accurate.

3. Locating Key Enzymes

With a more complete metabolic model, we can observe the metabolic fluxes of cells through simulations. To assess the degree of influence of individual enzymes in ecGEM on the target reaction, the flux control coefficients (FCC) were defined to correspond to the degree of limitation of each enzyme in the system [6], which is expressed as the ratio of the relative change in the target flux to the 0.1% change in the corresponding Kcat value of the enzyme and can be expressed as:

By calculating the FCC values of each enzyme in the simulation, we can find the key enzyme classes in the acarbose metabolic pathway to be targeted for modification.

4. Giving protein modification solutions


After finding the key enzyme, our model applies the toolkit Hotspot Wizard 2.0, which enables users to target and select mutational hotspots that affect protein stability and catalytic activity. Users can select suitable substitutions for individual amino acid sites based on predictions of highly similar homologous protein tolerance amino acids or amino acid distributions.


For multiple modification targets given by a single enzyme, we will re-predict the Kcat values for all possible mutant proteins and give the modification site with the greatest change in enzyme activity and the corresponding Kcat value.

Closed-loop Structure

In this way, our software becomes a self-validating and self-optimising closed-loop structure: We use a deep learning network to give missing parameters in ecGEM, to build a reliable and complete metabolic network model for Actinomycetes. With it, we can carry out simulations with acarbose production as the target function, to find this target reaction limiting enzyme. We set this key enzyme as a target for modification, combine with a protein hotspot identification tool to find the modification solutions. And then, we can update our deep learning network and metabolic network with the modification scheme to see if the acarbose production could be significantly improved, thus verifying the feasibility of our recommended scheme.

In the process of optimising the model parameters, we can use experimental data to bring the model performance closer to reality.

Test

Dry-Lab Validation

We used the test set to measure the predictive ability of our model, the size of which is 17010 .

We perform at least three updated versions of our DL-ec GEM model in our engineering circle. The results are presented below:

Model version.1

Model version.2

Model version.3

From the results, the predictive ability and robustness of our model have been improved, which proves the effectiveness of our model. For more details, click here to Proof Of Concept.

Wet-Lab Validation

Since the wet-lab experiment conditions are limited due to the epidemic, we use our optimized DL-ecGEM for flux analysis to simulate the wet-lab experiment and obtain satisfactory results.

For more details, click here to Proof Of Concept.

User's Feedback

Once our modeling platform was initially built, we contacted senior students in our cooperating lab to get feedback on the trial.

The senior students gave feedback that for researchers studying actinomycete, macroscopic understanding of cellular metabolic growth becomes more intuitive, daily use of basic functions becomes easier and faster, and the ability to give reference changes in modified proteins can guide wet experiments.

But when they tried our software for retrofit solution recommendation, only a single point mutation can be checked at a time in the software for the effect of kcat, but experiments often mutate amino acids at multiple sites.and they wondered if improvements could be made.

Learn


After the test of dry-lab、wet-lab and users, we learn some beneficial information for the following improvement of our software.


The improvement of data

When training our deep learning model, at first we used a total of 401 piece of information in actinomycete to predict the kcat value. Due to the small number of training set, as the epoch increases, the loss function first increased and then decreased. The overall prediction effect was not satisfactory. After reading relevant materials and consulting some experts ,the training set was expanded to a total of 17,010 pieces of all species. The performance was significantly improved.

The improvement of deep learning model




Protein Modeling

At first, we tried to train the model by protein similarity and tried to use sequence alignment to classify the learning data. However, the time complexity was too high to draw an evolutionary tree. Hence, the resulting alignment score was low. Then we used the protein length for group training,but we did not get satisfactory results, either.

Finally, we chose CNN when modeling proteins.





Substrate modeling

We have made some new attempts in molecular fingerprinting. We use MACCS, but MACSS cannot represent chemical substructures that do not appear in the library. We also try to use ECFP, but ECFP cannot accurately represent the structural information of substrate molecules.

Finally, we chose GNN when modeling substrates.

Add species-related information


Since the number of species words is large, using one-hot vector coding will occupy a lot of memory, so we use word embedding method and introduce Embedding layer. We take out the words that characterize the species names in the dataset, add the start identifier "" and the end identifier "" before and after each species name, connect them into a continuous text, and use one word before and one word after each word to predict the idea of the word.

After training the data set, the constituent words of each species name and the start identifier "<start>" and the end identifier "<end>" can be represented by a vector, and all the constituent words of each species, the start identifier "<start>" and the end identifier "<end>" can be represented by a vector.

The vectors corresponding to all constituent words of each species, the vector corresponding to the start identifier "<start>", and the vector corresponding to the end identifier "<end>" are summed to obtain the vector representing each species.

We observed smaller values of the loss function than the original method when training on a dataset consisting of 401 randomly selected messages. This indicates that we have not only added material information but also optimized the original model.


Network Hierarchy


After communicating with our PI Zhang Yue, we decided to use the GAN algorithm to improve the accuracy of Kcat value prediction. On the basis of the previous GNN and CNN, we added the GAN part after merging the data. A Gaussian distribution is used to generate a false vector according to its mean and standard deviation after each new vector is obtained. Both the true and false vectors are used to calculate the value of the loss function. We try to use multiple iterations to approximate the real situation. However, it did not improve the accuracy rate greatly. On the contray, it increased the cost of training the model. So we abandoned this module.

The improvement of user interactivity

In combination with user feedback, the software has been developed with some new features to support mutating amino acids at multiple sites of the enzyme and return kcat values. Besides ,our software can also update the GEM model in real time to see the effect of mutation on flux. This greatly improves the usefulness of the software and shortens the experimental cycle.

Further Improvement

After summarizing the results of dry-lab and wet-lab experiments and user's feedback, we have conducted further research and make plans for the further improvements of our software.

  • At present, our software only uses the built-in GEM model of Actinoplanes sp. SE50/110 as the model organism, but in the future, we can design the GEM model of a certain organism input by the user automatically, implementing a series of prediction and validation functions of our software.
  • The software can also be linked with large metabolomics databases and enzyme databases, using their updated metabolic reaction information and enzyme parameters to update GEM model in time.
  • To be continued....