modelling.svg

Modeling

  1. Abstract
  2. Homology Based Modeling
  3. Molecular Dynamics
  4. ProteinGAN
  5. References

Modeling of DCDDH (tphB)

As we were developing our project plan, we realized there were some pieces in our puzzle that were better known than others. Therefore, we decided to focus on the enzyme TPHB (also known as DCDDH), involved in the degradation of terephthalate (TPA) via the protocatechuate (PCA) 4,5-cleavage pathway. This enzyme is a critical step in our pathway as it catalyzes the dehydrogenation of 1,2-dihydroxy-3,5-cyclohexadiene-1,4-dicarboxylate (DCD) to yield protocatechuate (PCA). This enzyme has never been experimentally determined, although we have publicly available homology and ab initio models.

Our goal was to get some more insights over this enzyme and to produce new variants of tphB that could improve its catalytic activity and consequently the whole pathway. First, we decided to perform some descriptive analysis.

We studied the protein binding site with CavityPlus, and the protein-ligand interaction with DockThor. We also built our own homology-based model with MODELLER and performed a molecular dynamics simulation with GROMACS. Finally, we decided to use an AI approach, a Generative Adversarial Neural Network (GANN) called ProteinGAN which is able to learn the evolutionary relationships from natural protein sequence diversity and creates new variants with natural-like properties. Finally, we predicted de novo our variants folding with Alphafold 2.0 and studied how they compare with the natural tphB.

Apart from ab initio methods, now widely known due to the massive spread in its use since the release of Alphafold 2.0 (Jumper et al. 2021), more conservative in silico homology modeling had been used up to date with the highest accuracy when proper templates were available. Below analysis shows how we modelled by homology based on experimental template X-ray Crystal structure of a Terephthalate 1,2-cis-dihydrodioldehydrogenase from Burkholderia xenovorans LB400 (4aty.1) with tphB sequence from Comamonas testosteroni (accession code Q5D0X4 Uniprot).

MODELLER (Webb & Sali 2016) is a computer program used for homology-based modeling of protein tertiary, and quaternary, structures, based on how well these models satisfy spatial restraints. The best models were selected based on the DOPE and GA341 score.

The Discrete optimized protein energy, DOPE, score defines how much energy a protein model has based on the interactions between its parts, evaluating how good a homology-based modeling prediction of a protein structure prediction is based on spatial restraints. As such, the lower the normalized DOPE score, the better the protein structure should be.

GA341 score distinguishes good from bad models, scoring them from worst, 0.0, to best, 1.0, based on the statistical potential of a model protein’s fold. It results from the combination of a model’s compactness, the percentage of sequence identity of the alignment between model and template, and on the combined statistical potential z-score of the model (Eramian et al. 2008).

Filename molpdf DOPE score GA341 score Normalized DOPE score
tphB.B99990001.pdb 2.873,95654 -33.737,78906 1,00000 -0,496088827
tphB.B99990002.pdb 3 149,73071 -33 248,76953 1,00000 -0,386134538
tphB.B99990003.pdb 2 970,07446 -33 016,90625 1,00000 -0,333994119
tphB.B99990004.pdb 2 946,83569 -33 285,92188 1,00000 -0,394218766
tphB.B99990005.pdb 2 875,59399 -33 400,66406 1,00000 -0,420081271

DOPE and GA341 scores and Normalized DOPE score for tphB homology-based models generated with MODELLER with B. xenovorans LB400 tphB X-ray Crystalography template (4aty.1).

Docking and Ligand interactions

Cavity plus is a program that facilitates the detection of protein cavities and, through that, of potential binding sites. To detect the binding site on DCDDH, its sequence was uploaded onto Alphafold. Using the resulting modelled structure, the most likely active site was calculated on cavity. This allowed the model to be run on Dockthor. Dockthor predicts docking within a molecule. As such, it proposes the energy associated with the binding of a substrate with the enzyme according to different changes occurring in the environment, as defined by the protonation state of different significant groups, and the protein sequence. Using the information gathered about the active sites location, it was possible to calculate the most likely interaction between ligand and DCDDH.

Molecular dynamics (MD) is a computer simulation method to represent how molecules pysical movement behave in a complex system, as they interact for a fixed period of time. They are commonly described by Newton's equations of motion for an interacting particle system and uses potential energies for calculating interatomic potentials or molecular mechanics force fields.

GROMACS, which originated from Groningen Machine for Chemical Simulation, is a software package developed by the University of Groningen in the early 1990s. GROMACS now is a community-driven, versatile package to perform different kinds of molecular dynamics, available under the GNU Lesser General Public License (LGPL). (van der Spoel et al. 2005). It can simulate any molecule in solution or crystal with molecular dynamics, stochastic dynamics or path integration methods, minimize molecular energy, analyze conformation, etc. It was primarily designed for complicated bonded interactions but is now also used for the simulation of non-biological systems. Its simulation package includes GROMACS force fields (proteins, nucleotides, sugars, etc.), and studies can range from glass and liquid crystals to polymers, crystals, and biomolecular solutions. GROMACS is a powerful molecular dynamics simulation software, which has great advantages in simulating the Newtonian motion of a large number of molecular systems.

In our project, tphB is an enzyme whose protein structure has not been experimentally reported yet, although Alphafold prediction and homology-based models are available.. In our modeling strategy, we used the best DOPE scoring model from MODELLER to perform our simulation. For insights about implementation protocol check GROMACS Tutorials for Lysozyme in Water.

Figure 1. Potential energy during energy minimization. The potential energy should be negative and on the order of 105-106. As the figure shows, energy minimization was successful with potential energy decrease to and stable at a certain energy level. XMGrace for plotting. Results from GROMACS.

After EM, the model will have a reasonable starting structure. But the EM only happens between solvent itself, to ensure a proper geometry and solvent orientation also contain protein, an equilibration must be conducted before actually unrestrain the system.

gromacs-temp.png
gromacs-pressure.png
gromacs-density.png

Figure 2. Plots of equilibration for temperature, pressure and density. Temperature, pressure and density are equilibrated in sequence. All three equilibrations are successful as they remain stable during the process. XMGrace for plotting. Results from GROMACS.

When all pre-work is done, the model will be ready for MD simulation. After simulation, root-mean-square deviation (RMSD) can be calculated. When an equilibrated structure follows crystal RMSD pattern we can assume we have a stable system.

rmsd.png

Figure 3. Plot of RMSD of crystal (red line) and equilibrated (black line) protein. RMSD off to ~0.3 nm (3 Å) for both crystal and equilibrated stands for a stable protein. XMGrace for GROMACS.

Another analysis we can perform is the compactness of the protein by measuring the stability of the radius of gyration (Rg). If a protein is stably folded, the Rg will remain relatively steady over the time. The atoms are explicitly mass-weighted and the radius measured over the course of 1ns at 300 K degrees.

gyration.png

Figure 4. Lines represent the mass-weighted root-mean-square of the radius for total (black) but also about the x- y- z- axes, as a function of time. Each with its respective value is maintained mostly steady over 1 ns. XMGrace for plotting. Results from GROMACS.

tphbmdsim.gif

Figure 5. MD Simulation of tphB top DOPE scoring model from MODELLER in a water solution over 1ns at 300 K. Simulation from GROMACS.

Proteins play a vital role in several different metabolic processes and most importantly, are the building blocks of all forms of life. The main way of understanding how they function is to understand their structure. As proteins are made of amino acids placed in specific sequences, there can be different ways to predict the way a protein can exist in nature. Natural proteins that have gotten their form and function out of thousands of years of evolution and several different structural conformations into being stable and functional.

As much as observing proteins in nature is popular in order to understand what nature does, we have also done several attempts to alter the way it exists to see what more it can do. One such practice is protein engineering. Protein engineering has the potential to improve the ability of the enzymes. Previously, this was done in laboratories by using methods such as point mutations (changing one very crucial amino acid in a functional site of the protein) and recombination (mixing segments) of naturally occurring homologous proteins (very similar in function and structure). These processes cost a lot of time and resources, which was not beneficial for high throughput analyses.

There had been many advents of high performance computing indulging in handling biological data. One such weapon is Machine Learning. This has the potential to exponentially reduce time to explore the space of protein folding.

One of the recent ML methods, developed by research groups in Sweden and Lithuania, is the ProteinGAN (Repecka et al. 2021). This system adapts a Generative Adversarial neural Network. Generative modeling is an unsupervised (does not have previous knowledge defined) learning method in machine learning that involves automatically discovering and learning the patterns in the data we give the network in such a way that the model will be able to generate new examples that plausibly could have been drawn from the original dataset.

The way ProteinGAN works is that the learning mechanism takes several conditions of why a protein exists a certain way (catalytic activity, structural chemistry) and also how it got to that point(evolution of the protein).

Hence this system works by introducing guided mutations, by learning the underlying amino-acid relationships, when given a set of protein sequences that belong to the same family, learning which of these mutations works the best in terms of stability and catalytic activity. The system uses two networks, a discriminator and a generator. These components work by the generator generating protein sequences that are discriminated into real, authentic data(the actual protein sequences) and the ones generated by the generator.

The discriminator takes in encoded data(pertaining to the amino acids and blank space between them, when used as FASTA sequence).

GAN.png

Figure 6. ProteinGAN learns the intrinsic relationships between natural protein sequences. ProteinGAN training scheme. Given a random input vector, the Generator network produces a protein sequence, which is scored by the Discriminator network by comparing it to the natural protein sequences. The generator tries to fool the discriminator by generating sequences that will eventually look like real ones (the generator never actually sees real enzyme sequences). (Repecka, Donatas, et al., 2021).

Subsequently, the generator uses the feedback provided by the discriminator (i.e., the characteristics that allowed it to tell generated data apart from real data) to generate new data. The generator never processes or analyzes real data and the data it produces. Therefore, its learning relies solely on the outcome of the analyses carried out by the discriminator (Repecka et al. 2021). Upon iteration of such actions, the system gets better at generating sequences that the discriminator cannot tell apart if the sequence was generated. This system produces functional proteins that could potentially be biologically active but don't exist in nature or yet to be discovered.

We decided to use this system for engineering our protein TPADO, that comes in 3 subunits(2 subunits each - TPDA1_COMSP(Q3C1E3), TPDA2_COMSP(Q3C1D5), TPDR1_COMSP(Q3C1E0), TPDR2_COMSP(Q3C1D2), TPDB1_COMSP(Q3C1E2), TPDB2_COMSP(Q3C1D4)). This protein is a part of the PET degradation process wherein it involves transporting TPA(terephthalate) into the cell and the degradation of TPA into PCA. Our aim was to generate the most potent variant for each of these subunits. We followed the instructions available at the ProteinGAN github repository, for setting up the environment, altering variables and dataset for the neural network[4].

To run our neural network, we had to generate data - for which we wanted to collect the protein sequences for the family of proteins each of our query belonged to - for which we collected the protein sequences of the subunits and proceeded to add elucidate the HMM profile for the protein(the hmmbuild function was used for this process). This was done on the Uppmax supercomputing cluster in Uppsala University. The tool used was HMMer. Once the profile was built, we searched against the protein database locally available to derive the closely related sequences with the HMMer tool, using the hmmsearch function. Upon deriving the set of sequences we built a database using one of the BLAST functions, makeblastdb.

We then split the available data into “train” and “test” data, wherein the train set was used to make the neural network learn and the test data to validate. The proportion was 80% of the data used to train and 20% to test. The data that was split was further vectorized by assigning random numbers to the amino acids.

To run the network, a Conda environment was set up on Uppmax, according to the advice of the reference material. With the databases prepared and formatted, we then proceeded to manipulate the script according to the requirements of the network. This was done by altering the dimensions of the vector matrix according to the size of the sequences available in the database created, including the enzyme class and so on.

Challenges

We managed to start training the network but we faced several problems that made us not fulfill our goal. The network was not finding the same shape as the model was requiring, but we fixed it by giving it the maximum sequence length (5000+ aa). We did not manage the network to identify the enzymatic class. We believe it might have been solved by filtering our dataset, removing long sequences or readjusting threshold parameters during BLAST. We realized it might have been too ambitious for our team with no previous skill or guidance on GANN.

Eramian D, Eswar N, Shen M-Y, Sali A. 2008. How well can the accuracy of comparative protein structure models be predicted? Protein Science : A Publication of the Protein Society 17: 1881–1893.

Highly accurate protein structure prediction with AlphaFold - PubMed. WWW document: https://pubmed.ncbi.nlm.nih.gov/34265844/. Accessed 6 October 2022.

Repecka D, Jauniskis V, Karpus L, Rembeza E, Rokaitis I, Zrimec J, Poviloniene S, Laurynenas A, Viknander S, Abuajwa W, Savolainen O, Meskys R, Engqvist MKM, Zelezniak A. 2021. Expanding functional protein sequence spaces using generative adversarial networks. Nature Machine Intelligence 3: 324–333.

Van Der Spoel D, Lindahl E, Hess B, Groenhof G, Mark AE, Berendsen HJC. 2005. GROMACS: fast, flexible, and free. Journal of Computational Chemistry 26: 1701–1718.

Webb B, Sali A. 2016. Comparative Protein Structure Modeling Using MODELLER. Current Protocols in Bioinformatics 54: 5.6.1-5.6.37.

Relevant Web Resources

  1. Biomatter-Designs: ProteinGAN
  2. MD Tutorials
  3. Zenodo ProteinGAN Additional Files