Modeling

"All models are wrong but some are useful" -George Box

Introduction

As shown in our project description, nanobodies are a highly specific, fast and easy-to-produce alternative to conventional vaccination strategies in the struggle to contain and reduce the spread of viruses. However, the bottleneck of this approach lies in finding nanobodies that specifically bind to a given virus. Current strategies, like phage display for nanobody library generation originated as an alternative to the immunization of animal test subjects. Nevertheless, these techniques can still be money and time consuming. Therefore, the generation of synthetic libraries and the computational determination of their binding affinity is evidently the next step to obtaining new nanobodies sequences in a faster and cheaper way. This is why we’ve designed a full workflow to generate nanobody sequences based on an arbitrary inputted viral epitope as follows:

Figure 1: Schematic representation of the sequence filtration across the workflow

Step 1 - Library Generation and filtering of sequences


Library generation

The bottleneck of our approach to determine binding nanobodies to our chosen epitope lies in the big number of possible nanobody sequences. Considering nanobodies with complementary determining regions (CDR) regions of lengths 9, 14 and 12, generating a library of every possible combination would imply 20^35 sequences to be analyzed, a nearly impossible task. Therefore, we used the library generated in literature that analyzed existing nanobodies and determined which specific amino acid variations still produced stable structures [1]. The approach of limiting the possible amino acids varying in each position reduced the number of sequences to around 10^14, which becomes more manageable.

Clustering

To additionally reduce the number of sequences, we based our modeling on literature that employs a clustering algorithm by correlating sequence similarity between nanobody pairs using the number of Cα atoms pairs with distance < 1Å [2]. By defining this similarity, a distance matrix can be produced with clusters of tens to hundreds of sequence structures, which can be averaged into a single sequence per cluster for further analysis.

From this point onwards, we’ve taken on two different but complementary approaches to predict the binding affinity for the proposed filtered nanobodies with the associated inputted viral epitope.

Step 2 - Modeling of Sequences/Predicting protein structures


Predicting protein structures

After reducing the sequence size to 10^5 sequences, we want to generate 3D structures with NanoNet, a deep-learning based folding algorithm optimized for nanobody structure generation. This program has proved to be faster and more accurate than AlphaFold for our particular application.

Step 3 - Docking with a given antigen/Sequence based binding affinity prediction


Docking

Using the obtained 3D protein structures from NanoNet a docking approach can be employed. For the docking of the in silico developed nanobodies to the hemagglutinin antigens RosettaDock can be used. Using a relatively simple parallel protocol on a supercomputer RosettaDock can achieve up to 1000 parallel dockings, making the high-throughput docking feasible [3]. The high-throughput docking protocol will result in a ranking of nanobodies with their corresponding binding affinities against the selected antigen.

Sequence-based binding affinity prediction

Despite the continuous improvement of protein structure prediction softwares, nano and antibody modeling is still a challenge for most widespread programs like AlphaFold, which do not accurately represent the binding sites (CDR loops) of nanobodies. The high flexibility of these loops makes them have a high B factor, which in turn, makes predictions imprecise. To tackle this difficulty, literature suggests the implementation of machine learning algorithms to determine the binding affinity of a given antigen and an antibody based only on the antibody amino acid sequence and its chemical properties (particularly pI and hydrophobicity) [3].

The ‘weighted k nearest neighbor’ algorithm employed consists of the measurement of the Levenshtein distance (which considers only amino acid identity) with Blosum62 (a substitution matrix that takes into account the different properties of amino acids) between antibody sequences, which is used to classify each sequence as a good or bad binder (Y/N). Furthermore, the accuracy of the method is improved through the use of the random forest algorithm, which accounts for chemical properties like the pI and hydrophobicity to refine the classification. This last point helps further reduce the sequence library after clustering. The suggested approach obtained a 76.3% accuracy when compared to the re-docking of experimentally found crystal structures of nanobodies and receptor proteins.

With the use of this code, our team wants to input a new viral strain sequence and calculate the best binding nanobody from the library generated in a previous point. Our code then makes a sequence comparison between the inputted epitope and the epitopes present in the training dataset and finds the most similar epitopes along with their corresponding nanobodies in that same dataset. Then, an additional comparison is done between the nanobodies found and the filtered nanobody library that was generated. The best matches are subsequently tested for binding with our inputted epitope in the machine learning algorithm.

Conclusion

With this workflow, our team is able to propose a viable number of best binding nanobodies to a target viral epitope, through docking and sequence-based affinity predictions. Critical to this workflow was the creation of a well curated and filtered library in order to make the docking and binding predictions computationally possible, as well as the integration of multiple programs optimized for each task.

References

[1] McMahon, C. et. al (2018) Yeast surface display platform for rapid discovery of conformationally selective nanobodies. Nature

[2] Cohen, T. et. al (2021) NanoNet: Rapid end-to-end nanobody modeling by deep learning at sub angstrom resolution. bioRxiv

[3] Ye, C. et. al (2021) Machine learning prediction of Antibody-Antigen binding: dataset, method and testing. bioRxiv