With the ideal solution to our project being the reengineering of a germinant receptor of B. subtilis, in vitro/vivo methodologies such as directed evolution and rational design would have been the go-to approaches to alter ligand specificity. However, the robustness of such methodologies to re-engineer sparsely researched germinant receptors within the time constraints of iGEM led us to take a different approach to solve our problem statement. Nonetheless, the prospect of reengineering the receptors was very much on our minds and as such, we looked to in silico approaches. Albeit a plurality of tools exists, many fail to deliver particularly useful results or are closed source. As such, we present InFinity 1.0, an open-source framework for reengineering proteins, specifically to alter and or improve ligand specificity and affinity. This is achieved through high-throughput combinatorial mutagenesis of specified residues of a protein which is then followed by evaluation of the mutations' effect on ligand affinity using molecular docking. Ranking of mutants and multiple sequence alignment then provides information about common motifs, and this information can then be used to narrow down which mutations are more likely to cause positive change in ligand affinity, and as such, the framework serves as an initial in silico tool to help guide rational design and screening efforts.
What sets our tool apart from current ones, is two-fold: 1. Implementation of the recently released machine learning-based docking tool, EquiBind (Stärk et al., 2022), and scoring function, ΔLin_F9XGB (Yang and Zhang, 2022), which along with massive parallelisation using High Performance Computing Clusters, allow us to sample a greater combinatorial space more efficiently and accurately than previously possible with open-source tools. 2. Our tool is an open-source framework that allows for the individual tools in each of the five steps to be interchangeable, such that the overall pipeline can be continuously improved in terms of accuracy and efficiency through the use of newer tools as they become available.
Figure 1: The InFinity 1.0 pipeline, highlighting the overarching steps and their individual methodologies.
1. Combinatorial mutagenesis
Using a simple .txt file the user inputs a wildtype protein sequence, wherein residues can be selected to be randomly mutated. As there are in the order of 20n possible mutants, where n is the number of selected residues, the complete combinatorial space can be randomly sampled to a subset as limited by the storage and compute resources available to the user. In the pipeline's current implementation, for an average-size protein, we advise a limit of 1,000,000 mutants per tb storage. With residues selected and a reasonable limit set, the algorithm will generate random combinatorial mutant sequences to be used in subsequent steps.
2. Structural mutagenesis
Using the generated sequences from step 1, and a supplied wt .pdb file, the mutants are introduced structurally using the PyMol (Schrödinger and DeLano 2020) mutagenesis tool, which substitutes residue sidechains and adjusts using the most optimal rotamer for each. As this step does not in its current format account for changes to the backbone and protein folding, it cannot be relied on to accurately model major structural changes, and as such the tool is best suited for fewer and less structurally significant mutations. i.e. those we may expect when we are only trying to alter protein specificity or affinity slightly. In future releases, this step will include energy minimisations using an adapted version of AlphaFold (Jumper et al., 2021).
3. High Throughput docking
An adapted version of the deep learning based docking tool, EquiBind takes .pdb files from step 2, and a .mol2 file of the target ligand and performs high throughput rigid docking, which in turn generates predicted binding poses of the ligand to each of the mutants. EquiBind’s use of a SE(3)-equivariant geometric deep learning model, significantly improves the computational efficiency of docking, and has been shown to have higher accuracy of binding poses when compared to comparable baselines (Stärk et al., 2022). With this task being highly efficient through the use of GPU acceleration, a single GPU is sufficient to evaluate ~200,000 mutants per day, but by splitting the mutant dataset, this step can be parallelised using more GPUs.
4. Affinity scoring
The ΔLin_F9XGB scoring function (Yang and Zhang, 2022) is an improved version of the linear empirical scoring function Lin_F9 utilising extreme gradient boosting and Δ-machine learning. ΔLin_F9XGB is on par or superior to some of the leading scoring functions when tested against the CASF-2016 benchmark (Yang and Zhang, 2022). With ΔLin_F9XGB being open-source, it was a prime candidate for affinity scoring and as such, we adapted it to work in our framework. Taking the .mol2 binding poses generated in step 3, the ΔLin_F9XGB scoring function can be run and massively parallelised according to the maximum number of CPU processing cores available to the user.
5. Multiple sequence allignment
Taking the top-ranking mutants from step 4, MSA can be carried out. As the useful information extracted from an MSA is dependent on the engineered protein, the specific MSA tool to be used for this step is the user's prerogative. Nonetheless, Clustal Omega (McWilliam et al., 2013) is recommended as a general first-line tool and can be used to look for recurrent substitutions among top-scoring mutant proteins. The end goal of this step is to look for trends in the specific types of residue substitutions seen in the top-scoring mutants. This can aid future rational design and help narrow down mutants to be tested in the laboratory.
Figure 2. pKd affinity values of CHEMBL1193101/ androgen receptor (3B5R) predicted complexes using the InFinity 1.0 pipeline and PsnpBind (Ammar et al., 2022) methodology.
While the methodologies differ in absolute affinity values, there is a similar trend with most mutations not causing major changes to binding affinity, except for A748D which improves binding substantially according to the InFinity 1.0 pipeline, and less so using PsnpBind.
Figure 2, shows a clear trend where most of the surveyed mutations have little effect on binding affinity, with the exception A748D which causes a big change according to InFinity 1.0 and less so, but still noticeably, with PsnpBind (Ammar et al., 2022). This exemplifies that InFinity 1.0 like PsnpBind can differentiate between the effect of in silico mutations. To further analyse InFinity 1.0’s robustness, a standardised benchmark of the combined EquiBind and ΔLin_F9XGB part of the pipeline was performed to compare accuracy against current tools. This benchmark was performed using the 19,119 protein-ligand complexes from the PDBBind 2020 database (Lieu et al., 2017). Binding poses were generated, and these were subsequently scored using ΔLin_F9XGB. A correlation coefficient between calculated and real binding affinities was calculated and graphed as shown in figure 3.
Figure 3. Scatter plot between the experimental pKd and predicted pKd ligand-protein complexes docked with EquiBind and scored by ΔLin_F9XGB. The red line is the correlation between experimental pKd and predicted pKd.
As seen in figure 3, the combination of EquiBind and ΔLin_F9XGB results in a Pearson’s correlation coefficient of 0.32 when comparing to experimentally determined affinities. While low compared to CASF-2016 rescoring benchmarks, it is comparable to that of established tools such as X-score, FlexX, AutoDock and BLEEP while being much faster(). Future improvement may come from utilising the flexible docking option of EquiBind and or utiliisng and alternative scoring function. The differences seen in figure 2 are likely in part owed to intrinsic differences in the scoring functions used but also the PsnpBind’s database use of energy minimisation after mutagenesis, which is yet to be implemented in the InFinity 1.0 pipeline. Nonetheless, the results indicate that InFinity 1.0 can indeed be used to draw generalised conclusions about which mutations are likely to alter ligand specificity and can thus be used to narrow down screening and or rational design efforts.
Although the framework uses specific tools to perform each of the steps, each step is interchangeable, and as such tools at each step can be substituted according to future improvements in the field. In the pipeline's current implementation, the results cannot be directly transferred to the lab in the form of a direct screen of the top mutants, yet useful information about trends can be used to help guide laboratory efforts, potentially lowering the time needed and laboratory associated costs. However, future improvements in the reliability of the individual steps may shift the balance toward a more direct use of the in-silico-screened mutants. In the immediate future, the implementation of accelerated energy minimisation from AlphaFold [4] in step 2, will improve the accuracy of the mutant screen.
[1] Ammar, A., Cavill, R., Evelo, C. et al., 2022, ‘PSnpBind: a database of mutated binding site protein–ligand complexes constructed using a multithreaded virtual screening workflow.’, J Cheminform 14, 8 (2022). Available at: https://doi.org/10.1186/s13321-021-00573-5
[2] Bohl CE, Wu Z, Chen J, Mohler ML, Yang J, Hwang DJ, Mustafa S, Miller DD, Bell CE, Dalton JT., 2008, 'Effect of Bring substitution pattern on binding mode of propionamide selective androgen receptor modulators.' Bioorg Med Chem Lett. 15;18(20):5567-70. Available at https://10.1016/.bmcl.2008.09.002
[3] Jumper, J., Evans, R., Pritzel, A. et al. (2021) ‘Highly accurate protein structure prediction with AlphaFold.’ Nature 596, 583–589. Available at: https://doi.org/10.1038/s41586-021-03819-2..
[4] Kim R, Skolnick J., 2008, 'Assessment of programs for ligand binding affinity prediction. J Comput Chem., 29(8):1316-31. Available at: https://doi.org/10.1002%2Fjcc.20893.
[5] Liu, Zhihai; Su, Minyi; Han, Li; Liu, Jie; Yang, Qifan; Li, Yan; Wang, Renxiao., 2017, ‘Forging the Basis for Developing Protein-Ligand Interaction Scoring Functions’, Accounts of Chemical Research, 50 (2): pp. 302-309. Available at: http://www.pdbbind.org.cn/index.php
[6] McWilliam H, Li W, Uludag M, Squizzato S, Park YM, Buso N, Cowley AP, Lopez R. 2013 Analysis Tool Web Services from the EMBL-EBI. (2013) Nucleic acids research 41(Web Server issue) W597-600. Available at https://doi.org/10.1093/nar/gkt376
[7] Schrödinger, L. & DeLano, W., 2020. PyMOL, Available at: http://www.pymol.org/pymol.
[8] Stärk, H. et al., 2022, ‘EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction’. Available at: https://doi.org/10.48550/arXiv.2202.05146.
[9] Yang, C. and Zhang, Y., 2022, ‘Delta Machine Learning to Improve Scoring-Ranking-Screening Performances of Protein–Ligand Scoring Functions’, Journal of Chemical Information and Modeling, 62(11), pp. 2696–2712. Available at: https://doi.org/10.1021/acs.jcim.2c00485.xd