Background

Protein design is essential for any protein engineering project. While designing new proteins is still challenging, improvement of existing ones by using site-directed mutagenesis seems to be feasible. As the amount of mutation sites space is huge, an in silico preselection of mutants is needed. Several tools are available that enable the prediction of the mutation effect on protein structures. However, they are complicated to use and they do not give any suggestions with which mutants to proceed in a particular experiment [1][2].

In order to address this problem, we generated the powerful software ProMutor.

ProMutor is a flexible, easy-to-use, open-source platform for generating advantageous point mutants. In our software, wild-type sequences are used to suggest a list of potent mutations based on explicit modelling of evolutionary history. It enables the user to pick the variants of interest based on epistatic scores. To our knowledge, it is the first web-based software to tackle the complexity of protein design.

Description

We developed a web-based user-friendly tool to facilitate protein design. As a starting point, our tool takes either nucleotide or amino acid sequences. Multiple operation modes are offered to the client to make the process of mutant creation as customised as possible. Based on the desired protein design settings, ProMutor provides a list of mutants with the predicted effect of introduced mutations based on the epistatic effect and amino acid conservation across various species.

The mutant generation pipeline begins with a wild-type sequence and a list of positions to be mutated. For instance, this could be a known epitope that is to be modified, post-translational modification sites that should be removed, etc. In addition, our tool could be used as a screening instrument for the effect of all possible single-point mutations. This would facilitate the screening for amino acids that are essential for the protein’s structure and function.

heatmap

Figure. Schematic representation of the ProMutor pipeline

Two modes enable various types of mutant generation. Once a mutation list is submitted, variants with either a similar WT structure (“Preserved” mode of the tool) or a significantly differing one (“Disrupted” mode of the tool) are designed. In both cases, substitution matrices such as BLOSUM or PAM [3] are utilised for a pre-selection of amino acids. A wide choice of those (8) is provided to the user so that the prediction could be as tunable as possible. In cases of conserving structures, most similar amino acids are chosen as substitutes since it's supposed that evolutionary-tolerated amino acid substitutions cause neither major perturbations on the protein's structure nor on its function. On the contrary, when the protein structure should be completely disrupted, the mutations that are least tolerated in evolution are selected to proceed with.

After compiling a list of mutants for testing, ProMutor evaluates the effect of the proposed interventions by predicting their mutational outcomes. For this purpose, a software called GEMME [4][5] is used, which evaluates mutations based on explicit modelling of the evolutionary history of natural sequences. In addition, GEMME is much faster than other existing tools. It takes into account only biologically meaningful and interpretable parameters and does not require any complex learning steps. Given input alignment, it generates a complete landscape of protein mutations in minutes. Extensive validation was performed to show that GEMME is on par or better than more complex tools. Moreover, GEMME could handle combinations of mutations and give reliable predictions for those.

For our purposes, HHblits [6] and Uniclust [7] database are chosen to create a multiple sequence alignment (MSA) required as an input for GEMME. By providing several databases (Uniclust, PDB70 [8], PfamA [9], SCOP70 [10]), the user has the freedom to choose. However, we suggest Uniclust database due to greater search space and more reliable predictions.

Once all steps are complete, the user gets a list of scored point mutants as well as their sequences. In order to enable reproducibility, all inputs and outputs are interpreted and reported. The results can be downloaded and displayed in several formats for easy usage. Last but not least, the user is provided with protein prediction possibilities as a follow-up step (ColabFold [11][12]).

Overall, this sophisticated pipeline is a multiple steps process which is fully automated without any user involvement. Calculations are performed on a powerful multi-core cluster and accessible through a single web form.

Note from authors: “Our vision is to provide a complete and powerful tool that makes protein design as painless as possible.”

Tutorials

This tutorial explains how to set every parameters in order to efficiently use both the mode that are available (screening and mutants generation). For more detailed explanations, please visit the Gitlab.

Screening mode

Input

Step 1: Choose your sequence type

Is it a DNA (nucleotide bases) or Protein (amino acids) sequence?

protein

Step 2: Insert your Sequence

You can choose to upload a fasta file (.fasta, .fa, .fna, .fnn, .faa) or paste the sequence as follows:

>sp|P27352|IF_HUMAN Cobalamin binding intrinsic factor OS=Homo sapiens OX=9606 GN=CBLIF PE=1 SV=2
MAWFALYLLSLLWATAGTSTQTQSSCSVPSAQEPLVNGIQVLMENSVTSSAYPNPSILIA MNLAGAYNLKAQKLLTYQLMSSDNNDLTIGQLGLTIMALTSSCRDPGDKVSILQRQMENW APSSPNAEASAFYGPSLAILALCQKNSEATLPIAVRFAKTLLANSSPFNVDTGAMATLAL TCMYNKIPVGSEEGYRSLFGQVLKDIVEKISMKIKDNGIIGDIYSTGLAMQALSVTPEPS KKEWNCKKTTDMILNEIKQGKFHNPMSIAQILPSLKGKTYLDVPQVTCSPDHEVQPTLPS NPGPGPTSASNITVIYTINNQLRGVELLFNETINVSVKSGSVLLVVLEEAQRKNPMFKFE TTMTSWGLVVSSINNIAENVNHKTYWQFLSGVTPLNEGVADYIPFNHEHITANFTQY

Step 3: Insert the position of mutation(s)

Set the position to 0 to enable screening mode

0

Step 4: Insert the positions that are always of mutated

Leave empty

Step 5: Insert maximum number of mutation(s)

Leave empty

Step 6: Choose structure preservation mode

By default: Preserved

Step 7: Choose the similarity matrix

By default: Blosum62

Step 8: Choose the database to be used

Select which database should be used to construct an MSA. Uniclust is recommended for more reliable predictions. Faster results may be produced with other choices.

Uniclust

Step 9: Submit the form

Press: Run

Or you can watch the video below:

Output

Selected parameters
Argument Chosen
Sequence >sp|P27352|IF_HUMAN Cobalamin binding intrinsic factor OS=Homo sapiens OX=9606 G N=CBLIF PE=1 SV=2
MAWFALYLLSLLWATAGTSTQTQSSCSVPSAQEPLVNGIQVLMENSVTSSAYPNPSILIA MNLAGAYNLKAQKLLTYQLMSSDNNDLTIGQLGLTIMALTSSCRDPGDKVSILQRQMENW APSSPNAEASAFYGPSLAILALCQKNSEATLPIAVRFAKTLLANSSPFNVDTGAMATLAL TCMYNKIPVGSEEGYRSLFGQVLKDIVEKISMKIKDNGIIGDIYSTGLAMQALSVTPEPS KKEWNCKKTTDMILNEIKQGKFHNPMSIAQILPSLKGKTYLDVPQVTCSPDHEVQPTLPS NPGPGPTSASNITVIYTINNQLRGVELLFNETINVSVKSGSVLLVVLEEAQRKNPMFKFE TTMTSWGLVVSSINNIAENVNHKTYWQFLSGVTPLNEGVADYIPFNHEHITANFTQY
Matrix Blosum62
Database uniclust
Number of mutant
Positions 0
Preserved True
Input type protein
Screening result

The results of the screening are displayed on a Heatmap Figure. Amino acids positions of the protein are represented on the x-axis while each possible conservation score is presented on the y-axis. The resulting ΔΔE scores indicate evolutionary tolerance of all possible substitutions per position (y-axis). Here, highly variable positions have ΔΔE ≈ 0 (have minor effect on protein structure) and more deleterious substitutions get lower ΔΔE scores. Simply, this plot depicts the mutations effect on protein stucture with red and blue corresponding to low and high effect, respectively.

heatmap

Figure. Example of the Heatmap results one should expect using the screening mode with the amino acids position on the x-axis and the evolutionary scores for every amino acids at each respective positions on the y-axis

Mutants generation

Input

Step 1: Choose your sequence type

Is it a DNA (nucleotide bases) or Protein (amino acids) sequence?

protein

Step 2: Insert your Sequence

You can choose to upload a fasta file (.fasta, .fa, .fna, .fnn, .faa) or paste the sequence as follows:

>sp|P27352|IF_HUMAN Cobalamin binding intrinsic factor OS=Homo sapiens OX=9606 GN=CBLIF PE=1 SV=2
MAWFALYLLSLLWATAGTSTQTQSSCSVPSAQEPLVNGIQVLMENSVTSSAYPNPSILIA MNLAGAYNLKAQKLLTYQLMSSDNNDLTIGQLGLTIMALTSSCRDPGDKVSILQRQMENW APSSPNAEASAFYGPSLAILALCQKNSEATLPIAVRFAKTLLANSSPFNVDTGAMATLAL TCMYNKIPVGSEEGYRSLFGQVLKDIVEKISMKIKDNGIIGDIYSTGLAMQALSVTPEPS KKEWNCKKTTDMILNEIKQGKFHNPMSIAQILPSLKGKTYLDVPQVTCSPDHEVQPTLPS NPGPGPTSASNITVIYTINNQLRGVELLFNETINVSVKSGSVLLVVLEEAQRKNPMFKFE TTMTSWGLVVSSINNIAENVNHKTYWQFLSGVTPLNEGVADYIPFNHEHITANFTQY

Step 3: Insert the position of mutation(s)

You can decide at which position you want to insert mutations.

28, 57, 97

Step 4: Insert the positions that are always of mutated

If a position from the STEP 3 has to be inserted in each generated sequence, you can specify it. Leave empty for a combinatory mix.

57

Step 5: Insert maximum number of mutation(s)

Based on the postions from STEP 3 you can choose the maximum number of mutations pre sequence to include. In this tutorial we choose postions 28, 57, 97 (so 3 different positions), therefore, we can decide to include up to 3 mutations per sequence. With the maximum number of mutations set to 3, the software generates singlets, doublets and triplets.

3

Step 6: Choose structure preservation mode

You can choose whether you want mutants with preserved or disrupted structure depending on the purpose. Preserved mode favoritise similar amino acids while disrupted favoritise more distant ones.

Preserved

Step 7: Choose the similarity matrix

Depending on the divergence or pairwise identity desired, you can select among Blosum45, Blosum50, Blosum62, Blosum80, Blosum90 or Pam30, Pam90 or even Pam250. By default, Blosum62 is selected.

Blosum62

Step 8: Choose the database to be used

Select against which database the MSA should be ran. Uniclust is recommended for more reliable prediction. Faster results may be produced with other choices.

Uniclust

Step 9: Submit the form

Press: Run

Or you can watch the video below:

Output

Selected parameters
Argument Chosen
Sequence >sp|P27352|IF_HUMAN Cobalamin binding intrinsic factor OS=Homo sapiens OX=9606 G N=CBLIF PE=1 SV=2
MAWFALYLLSLLWATAGTSTQTQSSCSVPSAQEPLVNGIQVLMENSVTSSAYPNPSILIA MNLAGAYNLKAQKLLTYQLMSSDNNDLTIGQLGLTIMALTSSCRDPGDKVSILQRQMENW APSSPNAEASAFYGPSLAILALCQKNSEATLPIAVRFAKTLLANSSPFNVDTGAMATLAL TCMYNKIPVGSEEGYRSLFGQVLKDIVEKISMKIKDNGIIGDIYSTGLAMQALSVTPEPS KKEWNCKKTTDMILNEIKQGKFHNPMSIAQILPSLKGKTYLDVPQVTCSPDHEVQPTLPS NPGPGPTSASNITVIYTINNQLRGVELLFNETINVSVKSGSVLLVVLEEAQRKNPMFKFE TTMTSWGLVVSSINNIAENVNHKTYWQFLSGVTPLNEGVADYIPFNHEHITANFTQY
Matrix Blosum62
Database uniclust
Number of mutant 3
Positions 28,57,97
Preserved True
Input type protein
Mutant sequences and scores

The mutant names are given together with their respective score in a table like below.
Sample output:

Scores: Mutant Score
I57V,M97L -0.623305422209993
V28I,I57V -0.867803185985955
V28I,I57V,M97L -1.2911568830377

Different functions are then available to display or download freshly generated sequences. Use the dropdown toggle to select a mutant. Then, press the button Show sequence to display its sequence. You can also open the sequence in a new tab using the button Open in new tab. Every generated sequence is separately downloadable in fasta format by selecting it from the dropdown and pressing Download fasta. If every mutant sequences have to be downloaded in a single fasta file, use the button Download every fasta.

References

  1. J.-E. Shin, A. J. Riesselman, A. W. Kollasch, C. McMahon, E. Simon, C. Sander, A. Manglik, A. C. Kruse, and D. S. Marks
    Protein design and variant prediction using autoregressive generative models
    Nature communications, vol. 12, no. 1, pp. 1–11, 2021.
    DOI: 10.1101/757252


  2. A. J. Riesselman, J. B. Ingraham, and D. S. Marks
    Deep generative models of genetic variation capture mutation effects
    bioRxiv, 2017
    DOI: 10.1101/235655


  3. S. Henikoff and J. G. Henikoff
    Amino acid substitution matrics from protein blocks.
    Proc Natl Acad Sci USA, vol. 89, 1992.
    DOI: 10.1073/pnas.89.22.10915


  4. E. Laine, Y. Karami, and A. Carbone
    GEMME: A Simple and Fast Global Epistatic Model Predicting Mutational Effects
    Molecular Biology and Evolution, vol. 36, pp. 2604– 2619, 08 2019.
    DOI: 10.1093/molbev/msz179


  5. M. H. Høie, M. Cagiada, A. H. Beck Frederiksen, A. Stein, and K. Lindorff-Larsen
    Predicting and interpreting large-scale mutagenesis data using analyses of protein stability and conservation.
    Cell Reports, vol. 38, no. 2, p. 110207, 2022.
    DOI: 10.1016/j.celrep.2021.110207


  6. M. Remmert, A. Biegert, A. Hauser, and J. Söding
    HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment
    Nature Methods, vol. 9, pp. 173–175, Feb. 2012.
    DOI: 10.1038/nmeth.1818


  7. M. Mirdita, L. von den Driesch, C. Galiez, M. J. Martin, J. Söding, and M. Steinegger
    Uniclust databases of clustered and deeply annotated protein sequences and alignments
    Nucleic Acids Res, vol. 45, pp. D170–D176, Nov. 2016.
    DOI: 10.1093/nar/gkw1081


  8. M. Steinegger, M. Meier, M. Mirdita, H. Vöhringer, S. J. Haunsberger, and J. Söding
    Hh-suite3 for fast remote homology detection and deep protein annotation
    BMC Bioinformatics, vol. 20, p. 473, Sep 2019.
    DOI: 10.1186/s12859-019-3019-7


  9. E. L. Sonnhammer, S. R. Eddy, and R. Durbin
    Pfam: a comprehensive database of protein domain families based on seed alignments
    Proteins, vol. 28, pp. 405–420, July 1997.
    DOI: 10.1002/(sici)1097-0134


  10. A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia
    SCOP: a structural classification of proteins database for the investigation of sequences and structures
    J Mol Biol, vol. 247, pp. 536–540, Apr. 1995.
    DOI: 10.1006/jmbi.1995.0159


  11. J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko, A. Bridgland, C. Meyer, S. A. A. Kohl, A. J. Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S. Bodenstein, D. Silver, O. Vinyals, A. W. Senior, K. Kavukcuoglu, P. Kohli, and D. Hassabis
    Highly accurate protein structure prediction with alphafold
    Nature, vol. 596, pp. 583–589, Aug 2021.
    DOI: 10.1038/s41586-021-03819-2


  12. M. Mirdita, K. Schütze, Y. Moriwaki, L. Heo, S. Ovchinnikov, and M. Steinegger
    Colabfold: making protein folding accessible to all.
    Nature Methods, vol. 19, pp. 679–682, Jun 2022.
    DOI: 10.1038/s41592-022-01488-1