In this page, we introduce our DL-ecGEM model from two parts, ecGEM and DLkcat. DLKcat model is composed of two parts, GNN and CNN.
The basic idea for our model is simple: the chemical reaction network in an organism has a series of constrains, so we can find a way to optimize the production of certain substances.
Traditional GEM model the steady state assumption, which means that the reaction rate of each substance equals to zero.
The chemical reaction network can be expressed by a stoichiometric matrix. As is shown in the figure below, rows of the matrix stands for each substance, and columns stands for each reaction. With the stoichiometric number of each substance in each reaction presented in the matrix, this matrix can demonstrate the reaction network completely and precisely.
Some issues are worth mentioning. This method is only applicable for irreversible reaction, so reversible reactions are separated into two irreversible reactions. Besides, circumstances like isozyme are regarded as pseudo metabolites. The behavior of pseudo metabolites are described by an 'enzyme pool'.
Given the matrix, we can span the reaction rates for each reaction in a linear space. The mathematical expression of each constrain is listed as follows:
Then a solution space can be retrieved. With this solutions space, the maximization or minimization of the target reaction can be solved as a LP (linear programming) problem
We use the GECKO method (GEM with Enzymatic Constraints using Kinetic and Omics data) to reconstruct the model.
In this method, another constrain is introduced: every reaction has a basic constrain that the reaction rate cannot exceed the maximum reaction rate vmax ,which is equals to the concentration of the corresponding enzyme times its Kcat value, namely:
We call it as the enzyme usage constrains.
To include the enzyme usage constrains mentioned above, enzymes are introduced into the stoichiometric matrix, so that the matrix is augmented. Thus, the basic process of our construction methodology can be expressed in the figure below:
In this work, we aim to construct the DL-ecGEM model for actinomycetes. The reaction list for actinomycetes is constructed automatically from Model-SEED database. By comparing Bacillus macrophyllum WSH002 to a closely related plant of the genus Coelicolor Streptomyces, another reaction list is retrieved by BLASTp method. The primitive model is the combination of two reaction lists mentioned above.
As supplementary, public information systems including BRENDA, KEGG, BiopathExplore, CELLO, TCDB and metacyc are used to fill metabolic gaps, balance mass and charge, determine the direction of reactions, add information about genes and reaction localization, add subsystem information to the reaction, and add some transport reactions. Then the GEM model for actinomycetes is ready for further processing by the GECKO method.
The original GEM model includes 1127 metabolites and 1336 reactions. After GECKO process, the model has 2789 metabolites, 6030 reactions, 984 enzymes, 677 pseudo metabolites and a variable for total enzyme usage, which is termed as the iYLW1029 version. The figure below illustrates the structure of the final model.
In this work, we use Flux Balance Analysis (FBA) by COBRA toolbox to simulate the growth of actinomycetes. In this case,Production of Acarbose is the target function to maximize. The condition on the petri dish should be defined manually.
The function returns a structure describing the best solution containing the target value f, solution x which stands for flux of each reaction, cost reduction w which stands for the influence value of each reaction on the target, solver state flag origstat which verifies the existence of the solution, and processing time.
The model iYLW1028 basically involves 31 reactions with regard to Acarbose production. As can be seen on the left, Acarbose 7-phosphate is regarded as the final product of cytoplasmic pathway. With the help of ABC transport protein, Acarbose 7-phosphate is transported out of the cell and lead to the formation of Acarbose.
We set the acarbose exchange reaction as an objective function, the maltose exchange reaction as a carbon source uptake reaction and the ammonia exchange reaction as a nitrogen source uptake reaction. We also simulated their growth given growth rate limits and upper limits for carbon and nitrogen source uptake, and obtained the metabolic fluxes for each of their reactions by linear solving.
To look for the restriction enzyme for Acarbose production and find the target for enzyme engineering, we use the GECKO2.0 toolbox for flux balance analysis. Flux Control Coefficients (FCC) is introduced for the evaluation of each enzyme's influence on the target reaction. It is defined as the ratio of target flux's relative variation to 0.1% change of kcat of an enzyme, namely:
There are two functions in the toolbox that can find important enzymes. The function 'findTopLimitations' return the enzyme with largest FCC, which means they are the most influential enzymes. The function 'ecFSEOF' looks for key pathways and key enzymes as well as the key genes that control key pathways.
Through modification and combination of these functions, We can retrieve 20 enzymes which provides the biggest restrictions in the process of Acarbose production.
The key enzymes we found are listed as follows:
To compensate for missing Kcat values in the Actinomyces database and to predict the effect of protein mutations on enzyme activity, we introduced a deep learning algorithm to predict the unique Kcat value corresponding to the substrate and protein, combined in ecGEM.
Structure of GNN model:
The substrates are represented as molecular diagrams where the vertices are atoms and the sides are chemical bonds.
For the substrate, there are only a few types of chemical atoms (e.g., carbon and hydrogen) and chemical bonds (e.g., single and double bonds). In order to obtain more learning parameters, for each atom, we take out all the atoms within its radius r to form a subgraph, and use the subgraph to represent the environment of each atom.
First, the substrate SMILES information is processed into rdkit objects using the RDKit library, the number of each atom and chemical bond is obtained and processed into a vector, and a nonlinear activation function is added at the end of the GNN to continuously update this vector. It is found experimentally that the best results can be achieved when r = 2 and the number of layers is 3.
Therefore, a 3-layer linear fully connected network is set up in the GNN. Inputting the smiles information of the substrate, GNN can output a real-valued molecular vector representation of a set of substrates.
The features of the atoms (chemical bonds) around each atom (chemical bond) are summed at each iteration, thus ensuring that the information in the subgraph is updated.
★ Vertex transformation:
The features of the nodes in the next layer are derived by summing the corresponding nodes and adjacent nodes in the previous layer using the sigmoid function.
★ Edge transformation:
The characteristics of the next layer of edges are derived by summing the corresponding nodes and adjacent nodes of the previous layer using the sigmoid function.
Structure of CNN model:
The protein sequences were divided into different words according to the ngram, and the words were numbered to convert the protein sequences into vectors. To prevent some words from having too low frequency, smaller ngram(3) are used. The vectors are updated using a 3-layer linear network and RELU activation function.
The code of our deep learning network is shown below.
We use MSE loss function as our loss function.
We use Adam as our optimizer.
[1] Wang Y, Xu N, Ye C, Liu L, Shi Z and Wu J (2015) Reconstruction and in silico analysis of an Actinoplanes sp. SE50/110 genome-scale metabolic model for acarbose production. Front. Microbiol. 6:632. doi: 10.3389/fmicb.2015.00632
[2] Sánchez, Benjamín & Zhang, Cheng & Nilsson, Avlant & Lahtvee, Petri-Jaan & Kerkhoven, Eduard & Nielsen, Jens. (2017). Improving the phenotype predictions of a yeast genome‐scale metabolic model by incorporating enzymatic constraints. Molecular Systems Biology. 13. 10.15252/msb.20167411.
[3] Gu D, Jian X, Zhang C, Hua Q (2016) Reframed genome-scale metabolic model to facilitate genetic design and integration with expression data. IEEE/ACM Trans Comput Biol Bioinform https://doi.org/10.1109/TCBB.2016.2576456
[4] Domenzain, I., Sánchez, B., Anton, M. et al. Reconstruction of a catalogue of genome-scale metabolic models with enzymatic constraints using GECKO 2.0. Nat Commun 13, 3766 (2022). https://doi.org/10.1038/s41467-022-31421-1
[5] Schellenberger, J., Que, R., Fleming, R. et al. Quantitative prediction of cellular metabolism with constraint-based models: the COBRA Toolbox v2.0. Nat Protoc 6, 1290–1307 (2011). https://doi.org/10.1038/nprot.2011.308
[6] Lu, H., Li, F., Sánchez, B.J. et al. A consensus S. cerevisiae metabolic model Yeast8 and its ecosystem for comprehensively probing cellular metabolism. Nat Commun 10, 3586 (2019). https://doi.org/10.1038/s41467-019-11581-3
[7] Li, F., Yuan, L., Lu, H. et al. Deep learning-based kcat prediction enables improved enzyme-constrained model reconstruction. Nat Catal 5, 662–672 (2022). https://doi.org/10.1038/s41929-022-00798-z
[8] Bendl J, Stourac J, Sebestova E, Vavra O, Musil M, Brezovsky J, Damborsky J. HotSpot Wizard 2.0: automated design of site-specific mutations and smart libraries in protein engineering. Nucleic Acids Res. 2016 Jul 8;44(W1):W479-87. doi: 10.1093/nar/gkw416. Epub 2016 May 12. PMID: 27174934; PMCID: PMC4987947.