DTU BioBuilders have developed a pipeline to design proteins to fit to a specific ligand of interest. The pipeline only needs a protein amino acid sequence as an input but more information can be provided to increase the ligand binding success rate. Data-driven protein design has helped us reduce the workload in the wet lab and gain deeper insights into the protein engineering process. HbaR and Hmox1 were chosen as our candidate proteins to be engineered to bind to our ligand of interest, furfural. These proteins were used to test the pipeline and gave us relevant information about protein stability, protein-ligand interaction, and possible mutations to increase the binding affinity. Ultimately, they serve as a sensing domain in the final synthetic transcription factor.

Overview

The pipeline comprises four main stages: 3D structure prediction, binding sites prediction, protein mutations and ligand dockings, and NAMD molecular dynamic simulation. The entire pipeline is visible in figure 1.

For 3D structure prediction, a state-of-the-art AlphaFold2 model was used. If the protein structure has already been experimentally determined this step can be skipped. In the next step, a machine learning based method, P2Rank, using random forest algorithm was used to predict all possible binding pockets of a protein. This would also include DNA binding pockets which is unwanted for the purpose of this project, so we created a filter to only include the most probable ligand binding pocket and extract its coordinates from the result. Once again, this step can be skipped if the coordinates of the ligand binding site is known from experimental data.

In the next step, protein mutations and ligand dockings are performed with RosettaLigand in which the coordinate information will be used as the initial search point. A detailed description of the information needed to succeed in this step can be found in the protocol section (link). As a substep, the outputs from RosettaLigand should be filtered and analyzed before moving to the final step which is NAMD molecular dynamic (MD) simulation. Like all other MD simulation methods, NAMD is very time consuming and computationally intensive, therefore, only the most probable mutation and docking model with a high chance of success should be chosen as an input. There are three ways to filter the quality of RosettaLigand’s outputs: violin plot, PyMol visualization, and superimposition. With violin plot, relevant energy terms can be compared between wild-type and mutant model. Qualitatively, PyMol visualization (combined with Ramachandran plot) and superimposition can be used to detect unusual interactions and side chain rotations and structure breaking mutations. Finally, you can run the MD simulation and observe how the protein behaves and interacts with the ligand under a solvent condition. NAMD plots provide you with information about different energy terms that change over time.

AlphaFold2

AlphaFold2 allows work in the SynBio sphere to far exceed what was previously thought to be state-of-the-art with regards to protein design and development, particularly with regards to the age-old computational problem of predicting protein structures without the need for performing resource intensive experimentation.To simplify, AlphaFold2 treats amino sequences like language sentences. Instead of grammar that dictates the sentence structure, highly conserved domains taken from protein databases on the internet will dictate how proteins will fold. The full paper of AlphaFold2 can be accessed here.

P2Rank

P2Rank predicts all possible binding sites on a protein using a machine learning algorithm called random forest. The output gives a nice visualization on PyMol to gain deeper insights on possible ligand interacting residues at the binding pockets. The web-based version of the tool can be accessed here and you can read more about its paper here.

Rosetta

Rosetta is an academic software suite that contains many tools for different purposes. In this project, we are interested in the RosettaLigand tool for ligand docking and protein design. All the steps to design our proteins (HbaR and Hmox1) follow exactly a protocol developed by (Moretti et al. (2016)). The illustration of the pipeline can be seen on its paper. Rosetta simply uses Monte Carlo minimization with the Metropolis criterion (MCMC) method to choose the best docking and design. The quality of the result is greatly influenced by the analysis parameters and constraints. To increase the chance of finding the optimal result, it is important to have a high number of docking predictions.

The input for Rosetta is the predicted structure from AlphaFold2 and the predicted binding sites from P2Rank. The result would be more reliable if the crystal structure of the protein and experimentally determined ligand binding site are used.

To interpret the result, we followed a Rosetta scores guide that was hosted on iGEM. The guide helped us to clearly understand the meaning of each parameter. You can find the document here.

NAMD

NAMD stands for Nanoscale Molecular Dynamics, was released in 2005 and remains one of the best tools available for calculating molecular dynamics simulations. However, when it comes to anisotropic systems such as lipid bilayers, NAMD performs worse compared to other tools such as GROMACS. For the purpose of this project, this problem can be ignored. The ligand-bound protein structure that was generated from the previous steps will be the input for NAMD. More steps would be needed to get all the necessary parameters and files for the NAMD to run. Fortunately, we could use a tool called “solution builder” that is hosted on CHARMM-GUI by Lehigh university to do this for us. Moreover, the tool generates a Unix script that contains all the commands to run the simulation for 10 ns. Users can edit the script to run for longer. Ideally, the simulation should run as long as possible. In our project, the simulation was run for only 20 ns due to computational resource limitations.

NAMD simulation takes into account every atom's potential energy and calculates their position in the following time-step. The equation to calculate the total potential energy of an atom can be seen below. The full details on how to calculate each potential energy can be read from the NAMD paper . The solution condition is assumed to be at constant temperature and pressure (known as NPT simulation). \[U_{total} = U_{bond} + U_{angle} +U_{dihedral} +U_{vdW} +U_{coulomb}\]

References

Jumper, J., et al. Highly accurate protein structure prediction with AlphaFold. Nature 596 (2021) p. 583-589
Moretti, Rocco et al. Rosetta and the Design of Ligand Binding Sites. Methods in molecular biology (Clifton, N.J.) 1414 (2016) p. 47-62
Phillips J. C. et al. Scalable Molecular Dynamics with NAMD Journal of computational chemistry 26,16 (2005) p. 1781-1802
Preetha Santhakumar et al. Molecular docking analysis of furfural and isoginkgetin with heme oxygenase I and PPARγ Bioinformation. 17,2 (2021) p. 356–362
rivák, R., Hoksza, D. P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminform 10, 39 (2018). https://doi.org/10.1186/s13321-018-0285-8

HbaR

Motivation

In (the paper about sBAD), the authors have developed a synthetic transcription factor to respond to benzoic acid and similar ligands by using HbaR as the sensing domain. Figure 2 shows the chemical structure of pHBA (benzoic acid) and our ligand of interest, furfural. They share a similar structure which motivated us to choose HbaR as a candidate protein to be subjected to rational modifications.

3D structure prediction - Alphafold2

Figure 3 shows the theoretical 3D structure of the HbaR sensing domain with the DNA binding domain (LexA) and activation domain (B112). The nuclear localization sequence (NLS) SV40 was engineered into the protein to transport the protein into the nucleus. Qualitatively, the model structure showed the protein folded well with distinct domains as annotated by us based on the protein sequence. In figure 4, we can see more details about the length and location of each domain. The regions that were associated with high predicted local distance difference test (plDDT) indicated a high model confidence while the regions with low plDDT could mean AlphaFold2 could not predict the regions well due to them naturally existing as a loop. These regions can be links between domains or signal peptides depending on their location. With this information, we can see from figue 4 that the protein is divided into three main domains. Interestingly, the first domain LexA was divided into two subdomains. The structure could be further investigated by interacting with figure 5.

Figure 5. Chemical structure of benzoic acid (left) and furfural (right)

Binding sites prediction

HbaR binding sites prediction could be done in two ways: machine learning based method (P2Rank) or homologous crystal structure (protein kinase A receptor, 5K8S).

P2Rank

P2Rank showed us there were eight binding sites in the protein but there were only two that were a part of the HbaR sensing domain as seen in figure 6. In table 1, pocket 5 had a higher probability to be a ligand binding site so its coordinates were used for ligand docking with Rosetta.

Name	Rank	Score	Probablity
pocket1	1	17.02	0.793
pocket2	2	7.37	0.386
pocket3	3	3.17	0.108
pocket4	4	2.5	0.069
pocket5	5	2.32	0.06
pocket6	6	1.45	0.019
pocket7	7	1.18	0.01
pocket8	8	0.94	0.005

Table 1. P2Rank result with corresponding statistic for each binding pocket

Homologous crystal structure (PDB ID: 5k8s)

Based on homology modeling, protein kinase A receptor (pkAr) was shown to have a similar secondary structure as HbaR. It was possible that AlphaFold2 also used an experimentally determined structure of pkAr to model HbaR. We investigated the similarity of the two structures by aligning the HbaR sensing domain of the AlphaFold2 model to a crystal structure of pkAr. The RMSD result was 1.334. A RMSD value of 0 means the two structures are perfectly aligned with each other, so the result shows us that the pkAr crystal structure is very similar to the Alphafold2 model structure of the Hbar sensing domain. From this result, we morphed the Alphafold2 model of HbaR to the cAMP-bound pkAr structure (figure 7) and replaced the cAMP with furfural. Interestingly, the binding site of pkAr had the same location as the predicted pocket 5 we found using the P2Rank method. This result was used as the input for ligand docking with Rosetta.

Protein mutations and ligand dockings

The list of selected mutations can be found here. The 8 lowest binding energy designs for each binding site prediction method were chosen. The first eight designs belong to the homologous crystal structure method. The last eight design belong to the P2Rank method.

PyMol visualization

P2Rank

The docking prediction with the lowest binding energy out of 1000 possible predictions was chosen to be analyzed. Figure 8 shows the residues in the binding pocket that interact with furfural within 4 Ångström as sticks. It can be observed that residue H320 forms two hydrogen bonds with furfural at both oxygen molecules. This interaction shows the residue could be critally for the binding of furfural to HbaR. The interaction can be further investigated by interacting with figure 9.

Figure 9. Residues that interact with furfural. Each stick represents a residue that is predicted to interact with furfural (Interactive)

Homologous crystal structure (PDB ID: 5k8s)

Due to the morphing of the Alphafold2 model of HbaR to fit the pkAr structure, only the parts of the Alphafold2 model that were matched with pkAr were retained and designed to dock furfural. Consequently, this affected the residues that interact with furfural as seen in figure 10, as only aliphatic residues are observed. The interaction can be further investigated by interacting with figure 11.

Figure 11. Residues (represented as sticks) that interact with furfural (Interactive)

Superimposition

Superimposition was done to predict and observe how the mutations affect the protein folding or function. It is important to know that all the models were generated by AlphaFold2 in this step. From figure 12, the results clearly showed that the P2Rank method to design the protein had less effect on the protein structure than using the pkAr crystal structure method. Low RMSD value means the protein structure was not greatly altered by the mutations.

Violin plots

Rosetta managed to design the HbaR binding site to have lower binding energy to furfural for both methods (figure 13). There were more predictions based on the P2rank method that had lower binding energy than the crystal structure based method. However, some predictions with the crystal structure based method had lower binding energy than what the P2Rank based method could achieve.

It was not enough just to look at the binding energy to determine the chance of successful ligand binding. The total energy was also important as it described the protein stability. Interestingly, figure 14 showed the crystal structure method to predict the HbaR binding site made the whole protein very unstable for both the mutated and non-mutated form. This could be due to the morphing of the Alphafold2 model to the crystal structure introducing more unfavorable rotamers to the protein residues.

NAMD simulation

In the final step, we observed the effect of the mutations through NPT simulation. Furfural was initially placed at the binding site of the HbaR sensing domain and the simulation was run for 20 nanoseconds.

Figure 15 shows how the non-mutated form of the protein interacts with furfural. Unsurprisingly, the ligand did not stay in the binding pocket and undocked quickly.

P2Rank

With the P2Rank method, we were fortunate to get a design that behaves as we expected in the simulation (figure 16). The ligand initially interacts with pocket 5 as predicted with the P2Rank tool and then jumps to pocket 6 after a while and stays there. You can observe how the alpha helix secondary structure changes to fit to the ligand after 150 ps.

Homologous crystal structure (PDB ID: 5k8s)

With the homologous crystal structure method, the ligand did not stay in the binding pocket (figure 17). However, it stayed in the pocket for longer than for the non-mutated form of the protein.

NAMD plots

All the relevant energy terms of the simulation at each timestep could be extracted and plotted as NAMD plots (figure 18). The simulation for the crystal structure method had the highest total energy. This meant, based on the simulation, the mutations made using the crystal structure method made the protein more unstable than the non-mutated form. The total energy between the non-mutated and mutated form of HbaR using the P2Rank method was approximately the same.