DTU BioBuilders have developed a pipeline to design proteins to fit to a specific ligand of interest. The pipeline only needs a protein amino acid sequence as an input but more information can be provided to increase the ligand binding success rate. Data-driven protein design has helped us reduce the workload in the wet lab and gain deeper insights into the protein engineering process. HbaR and Hmox1 were chosen as our candidate proteins to be engineered to bind to our ligand of interest, furfural. These proteins were used to test the pipeline and gave us relevant information about protein stability, protein-ligand interaction, and possible mutations to increase the binding affinity. Ultimately, they serve as a sensing domain in the final synthetic transcription factor.


The pipeline comprises four main stages: 3D structure prediction, binding sites prediction, protein mutations and ligand dockings, and NAMD molecular dynamic simulation. The entire pipeline is visible in figure 1.

Figure 1. In-silico protein design pipeline

For 3D structure prediction, a state-of-the-art AlphaFold2 model was used. If the protein structure has already been experimentally determined this step can be skipped. In the next step, a machine learning based method, P2Rank, using random forest algorithm was used to predict all possible binding pockets of a protein. This would also include DNA binding pockets which is unwanted for the purpose of this project, so we created a filter to only include the most probable ligand binding pocket and extract its coordinates from the result. Once again, this step can be skipped if the coordinates of the ligand binding site is known from experimental data.

In the next step, protein mutations and ligand dockings are performed with RosettaLigand in which the coordinate information will be used as the initial search point. A detailed description of the information needed to succeed in this step can be found in the protocol section (link). As a substep, the outputs from RosettaLigand should be filtered and analyzed before moving to the final step which is NAMD molecular dynamic (MD) simulation. Like all other MD simulation methods, NAMD is very time consuming and computationally intensive, therefore, only the most probable mutation and docking model with a high chance of success should be chosen as an input. There are three ways to filter the quality of RosettaLigand’s outputs: violin plot, PyMol visualization, and superimposition. With violin plot, relevant energy terms can be compared between wild-type and mutant model. Qualitatively, PyMol visualization (combined with Ramachandran plot) and superimposition can be used to detect unusual interactions and side chain rotations and structure breaking mutations. Finally, you can run the MD simulation and observe how the protein behaves and interacts with the ligand under a solvent condition. NAMD plots provide you with information about different energy terms that change over time.


AlphaFold2 allows work in the SynBio sphere to far exceed what was previously thought to be state-of-the-art with regards to protein design and development, particularly with regards to the age-old computational problem of predicting protein structures without the need for performing resource intensive experimentation.To simplify, AlphaFold2 treats amino sequences like language sentences. Instead of grammar that dictates the sentence structure, highly conserved domains taken from protein databases on the internet will dictate how proteins will fold. The full paper of AlphaFold2 can be accessed here.


P2Rank predicts all possible binding sites on a protein using a machine learning algorithm called random forest. The output gives a nice visualization on PyMol to gain deeper insights on possible ligand interacting residues at the binding pockets. The web-based version of the tool can be accessed here and you can read more about its paper here.


Rosetta is an academic software suite that contains many tools for different purposes. In this project, we are interested in the RosettaLigand tool for ligand docking and protein design. All the steps to design our proteins (HbaR and Hmox1) follow exactly a protocol developed by (Moretti et al. (2016)). The illustration of the pipeline can be seen on its paper. Rosetta simply uses Monte Carlo minimization with the Metropolis criterion (MCMC) method to choose the best docking and design. The quality of the result is greatly influenced by the analysis parameters and constraints. To increase the chance of finding the optimal result, it is important to have a high number of docking predictions.

The input for Rosetta is the predicted structure from AlphaFold2 and the predicted binding sites from P2Rank. The result would be more reliable if the crystal structure of the protein and experimentally determined ligand binding site are used.

To interpret the result, we followed a Rosetta scores guide that was hosted on iGEM. The guide helped us to clearly understand the meaning of each parameter. You can find the document here.


NAMD stands for Nanoscale Molecular Dynamics, was released in 2005 and remains one of the best tools available for calculating molecular dynamics simulations. However, when it comes to anisotropic systems such as lipid bilayers, NAMD performs worse compared to other tools such as GROMACS. For the purpose of this project, this problem can be ignored. The ligand-bound protein structure that was generated from the previous steps will be the input for NAMD. More steps would be needed to get all the necessary parameters and files for the NAMD to run. Fortunately, we could use a tool called “solution builder” that is hosted on CHARMM-GUI by Lehigh university to do this for us. Moreover, the tool generates a Unix script that contains all the commands to run the simulation for 10 ns. Users can edit the script to run for longer. Ideally, the simulation should run as long as possible. In our project, the simulation was run for only 20 ns due to computational resource limitations.

NAMD simulation takes into account every atom's potential energy and calculates their position in the following time-step. The equation to calculate the total potential energy of an atom can be seen below. The full details on how to calculate each potential energy can be read from the NAMD paper. The solution condition is assumed to be at constant temperature and pressure (known as NPT simulation). \[U_{total} = U_{bond} + U_{angle} +U_{dihedral} +U_{vdW} +U_{coulomb}\]


  • Jumper, J., et al. Highly accurate protein structure prediction with AlphaFold. Nature 596 (2021) p. 583-589
  • Moretti, Rocco et al. Rosetta and the Design of Ligand Binding Sites. Methods in molecular biology (Clifton, N.J.) 1414 (2016) p. 47-62
  • Phillips J. C. et al. Scalable Molecular Dynamics with NAMD Journal of computational chemistry 26,16 (2005) p. 1781-1802
  • Preetha Santhakumar et al. Molecular docking analysis of furfural and isoginkgetin with heme oxygenase I and PPARγ Bioinformation. 17,2 (2021) p. 356–362
  • rivák, R., Hoksza, D. P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminform 10, 39 (2018). https://doi.org/10.1186/s13321-018-0285-8



In (the paper about sBAD), the authors have developed a synthetic transcription factor to respond to benzoic acid and similar ligands by using HbaR as the sensing domain. Figure 2 shows the chemical structure of pHBA (benzoic acid) and our ligand of interest, furfural. They share a similar structure which motivated us to choose HbaR as a candidate protein to be subjected to rational modifications.

Figure 2. Chemical structure of benzoic acid (left) and furfural (right)

3D structure prediction - Alphafold2

Figure 3 shows the theoretical 3D structure of the HbaR sensing domain with the DNA binding domain (LexA) and activation domain (B112). The nuclear localization sequence (NLS) SV40 was engineered into the protein to transport the protein into the nucleus. Qualitatively, the model structure showed the protein folded well with distinct domains as annotated by us based on the protein sequence. In figure 4, we can see more details about the length and location of each domain. The regions that were associated with high predicted local distance difference test (plDDT) indicated a high model confidence while the regions with low plDDT could mean AlphaFold2 could not predict the regions well due to them naturally existing as a loop. These regions can be links between domains or signal peptides depending on their location. With this information, we can see from figue 4 that the protein is divided into three main domains. Interestingly, the first domain LexA was divided into two subdomains. The structure could be further investigated by interacting with figure 5.

Figure 3. AlphaFold2 model 3D structure of HbaR with other domains
Figure 4. plDDT at each residue position for 5 different models
Figure 5. Chemical structure of benzoic acid (left) and furfural (right)

Binding sites prediction

HbaR binding sites prediction could be done in two ways: machine learning based method (P2Rank) or homologous crystal structure (protein kinase A receptor, 5K8S).


P2Rank showed us there were eight binding sites in the protein but there were only two that were a part of the HbaR sensing domain as seen in figure 6. In table 1, pocket 5 had a higher probability to be a ligand binding site so its coordinates were used for ligand docking with Rosetta.

Figure 6. Binding sites on HbaR sensing domain colored in red and pink
Name Rank Score Probablity
pocket1 1 17.02 0.793
pocket2 2 7.37 0.386
pocket3 3 3.17 0.108
pocket4 4 2.5 0.069
pocket5 5 2.32 0.06
pocket6 6 1.45 0.019
pocket7 7 1.18 0.01
pocket8 8 0.94 0.005
Table 1. P2Rank result with corresponding statistic for each binding pocket

Homologous crystal structure (PDB ID: 5k8s)

Based on homology modeling, protein kinase A receptor (pkAr) was shown to have a similar secondary structure as HbaR. It was possible that AlphaFold2 also used an experimentally determined structure of pkAr to model HbaR. We investigated the similarity of the two structures by aligning the HbaR sensing domain of the AlphaFold2 model to a crystal structure of pkAr. The RMSD result was 1.334. A RMSD value of 0 means the two structures are perfectly aligned with each other, so the result shows us that the pkAr crystal structure is very similar to the Alphafold2 model structure of the Hbar sensing domain. From this result, we morphed the Alphafold2 model of HbaR to the cAMP-bound pkAr structure (figure 7) and replaced the cAMP with furfural. Interestingly, the binding site of pkAr had the same location as the predicted pocket 5 we found using the P2Rank method. This result was used as the input for ligand docking with Rosetta.

Figure 7. Green: Crystal structure of protein kinase A binding to cAMP (shown as sticks). Red: AlphaFold2 model with HbaR sensing domain in view (RMSD = 1.334)

Protein mutations and ligand dockings

The list of selected mutations can be found here. The 8 lowest binding energy designs for each binding site prediction method were chosen. The first eight designs belong to the homologous crystal structure method. The last eight design belong to the P2Rank method.

PyMol visualization


The docking prediction with the lowest binding energy out of 1000 possible predictions was chosen to be analyzed. Figure 8 shows the residues in the binding pocket that interact with furfural within 4 Ångström as sticks. It can be observed that residue H320 forms two hydrogen bonds with furfural at both oxygen molecules. This interaction shows the residue could be critally for the binding of furfural to HbaR. The interaction can be further investigated by interacting with figure 9.

Figure 8. Residues that interact with furfural. Each stick represents a residue that is predicted to interact with furfural
Figure 9. Residues that interact with furfural. Each stick represents a residue that is predicted to interact with furfural (Interactive)
Homologous crystal structure (PDB ID: 5k8s)

Due to the morphing of the Alphafold2 model of HbaR to fit the pkAr structure, only the parts of the Alphafold2 model that were matched with pkAr were retained and designed to dock furfural. Consequently, this affected the residues that interact with furfural as seen in figure 10, as only aliphatic residues are observed. The interaction can be further investigated by interacting with figure 11.

Figure 10. Residues (represented as sticks) that interact with furfural
Figure 11. Residues (represented as sticks) that interact with furfural (Interactive)


Superimposition was done to predict and observe how the mutations affect the protein folding or function. It is important to know that all the models were generated by AlphaFold2 in this step. From figure 12, the results clearly showed that the P2Rank method to design the protein had less effect on the protein structure than using the pkAr crystal structure method. Low RMSD value means the protein structure was not greatly altered by the mutations.

Figure 12. The left image shows the P2Rank method and the right image shows the homologous crystal structure method. RMSD = 1.377 (P2Rank) and 13.838 (homologous crystal structure). Before (green) and after (red) mutations

Violin plots

Rosetta managed to design the HbaR binding site to have lower binding energy to furfural for both methods (figure 13). There were more predictions based on the P2rank method that had lower binding energy than the crystal structure based method. However, some predictions with the crystal structure based method had lower binding energy than what the P2Rank based method could achieve.

Figure 13. Binding energy of furfural to the binding pocket of both protein forms for each binding site prediction method

It was not enough just to look at the binding energy to determine the chance of successful ligand binding. The total energy was also important as it described the protein stability. Interestingly, figure 14 showed the crystal structure method to predict the HbaR binding site made the whole protein very unstable for both the mutated and non-mutated form. This could be due to the morphing of the Alphafold2 model to the crystal structure introducing more unfavorable rotamers to the protein residues.

Figure 14. Total energy of both protein forms for each binding site prediction method

NAMD simulation

In the final step, we observed the effect of the mutations through NPT simulation. Furfural was initially placed at the binding site of the HbaR sensing domain and the simulation was run for 20 nanoseconds.

Figure 15 shows how the non-mutated form of the protein interacts with furfural. Unsurprisingly, the ligand did not stay in the binding pocket and undocked quickly.

Figure 15. Simulation of furfural docking to the non-mutated HbaR binding domain


With the P2Rank method, we were fortunate to get a design that behaves as we expected in the simulation (figure 16). The ligand initially interacts with pocket 5 as predicted with the P2Rank tool and then jumps to pocket 6 after a while and stays there. You can observe how the alpha helix secondary structure changes to fit to the ligand after 150 ps.

Figure 16. Simulation of furfural docking to the mutated HbaR binding domain for P2Rank binding site prediction method

Homologous crystal structure (PDB ID: 5k8s)

With the homologous crystal structure method, the ligand did not stay in the binding pocket (figure 17). However, it stayed in the pocket for longer than for the non-mutated form of the protein.

Figure 17. Simulation of furfural docking to the mutated HbaR binding domain for homologous-crystal-structure binding site prediction method

NAMD plots

All the relevant energy terms of the simulation at each timestep could be extracted and plotted as NAMD plots (figure 18). The simulation for the crystal structure method had the highest total energy. This meant, based on the simulation, the mutations made using the crystal structure method made the protein more unstable than the non-mutated form. The total energy between the non-mutated and mutated form of HbaR using the P2Rank method was approximately the same.



Figure 18. Namd plots of total energy for each protein form from each binding site prediction method



Hmox1 was chosen to be a candidate protein because in a paper by Santhakumar et al. (2021), the protein was computationally shown to have a high binding affinity to furfural. Hmox1 natively binds to heme, as seen in an experimentally determined structure (1N3U). We replaced the heme ligand with furfural and performed docking using Rosetta.

3D structure prediction - Alphafold2

For the modeling of Hmox1, a VP16 activation domain was used instead of the activation domain B112 to observe the predicted structure of the VP16 activation domain. All domains except for the activation domain were folded properly in Alphafold2 (figreu 19). This resulted in a long loop being created at the end of the protein that also included a part of the Hmox1 sequence. Nevertheless, the Hmox1 sensing domain was predicted with good confidence and could thus be used for docking in Rosetta. The structure can be further investigated by interacting with figure 20.

Figure 19. Model 3D structure of Hmox1 with other domains
Figure 20. Model 3D structure of Hmox1 with other domains (Interactive)

The disorder structure of the activation domain can also be seen in figure 21. Compared to figure 4, the transactivation domain at the end of the protein sequence seemed to be lost with this model predictions.

Figure 21. plDDT at each residue position

Ligand dockings

PyMol visualziation

The aldehyde group of furfural interacts with Hmox1 the same way as the carboxylic acid group of heme does (figure 22). Since heme has two carboxylic acid groups, furfural could potentially have two locations to interact with Hmox1. However, furfural is much smaller than heme, so this would likely affect the protein function. The docked-3D structure of both ligands can be interacted in figure 23 and 24.

Figure 22. Furfural and Heme binding to Hmox1
Figure 23. Furfural binding to Hmox1 (Interactive)
Figure 24. Heme binding to Hmox1 (Interactive)

Violin plots

Furfural showed a higher binding energy than heme did, as we expected (Figure 25). The difference was between -6 REU to -8 REU with a p-value of \(2.2\times10^-16\).

Figure 25. Binding energy of Furfural and Heme to Hmox1

There was a small but significant (p-value: \(2.2\times10^-16\)) difference between the total energy of both ligands (Figure 26). Moreover, Hmox1 seemed to be more sensitive to heme conformation, as both of its energy term results had a larger value range than furfural. A small change in heme placement at the binding pocket could greatly affect Hmox1 stability. This could be due to heme being larger than furfural and therefore interacting more with Hmox1.

Figure 26. Total energy of Hmox1 when binds to Furufral or Heme


With a completed pipeline setup and successful simulations, the results of the pipeline were presented to the wet-lab team to start molecular engineering. It was important to keep in mind that the results for each step were hypothetical and therefore it was hard to determine the success rate of subsequent molecular engineering. Nevertheless, the presented pipeline was an effective starting point for a rational protein design that could potentially save a lot of time in the wet-lab when looking to optimize ligand binding of transcription factors.

From the result, for the HbaR protein, the P2Rank method was shown to have a higher chance of being successful. The NAMD simulation also showed that binding pocket 6 should be engineered to fit a new ligand instead of pocket 5. And for Hmox1, no mutations were needed as the energy terms showed both ligands interacting with the protein in almost the same way.

In-silico protein design pipeline

Drylab notebook