AFCM IGEM 2022

Introduction:

The last couple of years have experienced a huge expansion in the field of structural biology, in addition to an extensive increase in the availability of models that predict the 3D structure of proteins. That’s why it’s become of utmost importance to build models that can locate the functional sites on the surface of different proteins, which are responsible for the interactions that occur between them and other biomolecules so as to achieve the biochemical reactions that are fundamental for maintaining all biological functions. This year, we tackled this problem by creating a new computational model for predicting the binding sites on the surface of the protein sequence of interest through a classifier model and by training a Convolutional Neural Network (CNN) to perform accurate predictions of these interaction sites. Moreover, we’ve included another model for measuring the affinity between the introduced protein and its respective aptamer or antibody by building a regression model that can rank protein-antibody or protein-aptamer interactions based on their ΔG calculations that measure the stability of the resulting complex. Furthermore, we’ve developed a novel aptamer prediction tool due to the fact that the process of experimental aptamer selection has many limitations in the lab, and there aren’t many available softwares to tackle this problem. And lastly, we will discuss our novel Lateral Flow Assay (LFA) semi-quantification tool that we’ve developed to aid in interpreting the results of our LFA test for phenylketonuria. Herein, we will discuss the development process and results of this year’s software and our pipeline for building each of these models.

Figure 1. Demonstrating our pipeline for predication of protein binding sites, affinity modelling, and structural validation of the results.

Our pipeline for this year’s software:

Prediction of binding site residues on surface of different proteins.
Calculating the affinity and physicochemical properties of Protein-Antibody and Protein-Aptamer complexes.
Anti-CRISPR Classification model for predicting Anti-CRISPR proteins.
Semi-quantification software for phenylketonuria Lateral Flow Assay (LFA) test.
Whole-Cell Biosensor (WCB) semi-quantification tool.
Novel Colorimetric analyser software.

Our software pipeline and all the notebooks are now available on our team’s Gitlab software website. You can access our team’s software and notebook through this link: https://gitlab.igem.org/2022/software-tools/afcm-egypt

1- Prediction of binding site residues on surface of different proteins.

We started our work on the model by gathering state-of-the-art datasets for the training process through retrieving them from published literature and from other models for predicting interaction sites on proteins such as COACH and BS-PRED. To design our model, we’ve constructed layers of deep neural networks in the form of 1-dimensional convolutional neural networks (1d-CNN), which are used to train our model using the aforementioned datasets to be able to predict the sequence of amino acids that form the binding site of this protein. After which, dense layers collect the resulting information, and through a sigmoid activation function, returns the output in the form of positive or negative, indicating whether the predicted binding site is the correct one or not. The datasets used were divided into 2 main categories; positive and negative sets to train the model on how to accurately predict the binding sites, and the training process depended on comparing each of the predicted probabilities to an actual class output which can be either 0 or 1 by using a binary cross-entropy function, which also aids in minimizing the degree of cross-entropy loss.

he model have shown great results in terms of training results, with 96% accuracy of binding site prediction. And when tested with to a different dataset, it has also shown an outstanding 95% accuracy as shown in the figures below.

Figure 2. Shows the architecture of binding site prediction model in addition to confusion matrices and ROC curves showing the performance of our classification model. The training of the model shows an AUC of 0.96 and the testing to another dataset has shown an AUC of 95%

2- Calculating the affinity and physicochemical parameters of the resulting aptamers:

After running our software to predict the probable aptamers for a protein sequence, the predicted candidates are ranked according to their binding affinity to their target protein and various physical and chemical parameters, which are essential for the characterization of the predicted aptamer sequences.

Adaptation of our affinity regression model to aptamer-protein and antibody-protein complexes:

For this approach, we’ve decided to use a regression model to calculate the quantitative values of the aptamer-protein and antibody-protein affinity using ΔG calculations which measure the stability of the formed complexes. A linear activation function was added to enable the model to control how well our network learns the training dataset, and define the type of predictions the model can make.

The training for this model began with a search through available literature for suitable datasets including sequences for proteins and their respective antibodies and aptamers, with calculations of the ΔG of the formed complexes. Following that, we started building our neural network, but in this model, we’ve added 6 dense layers to allow the training of our model to be much more accurate. For the training of the model for antibody-protein affinity prediction, we used the datasets from The Structural Antibody (SabDab) database, forming a dataset of 493 antibody-protein complexes. After this, we processed this dataset with our binding site prediction model to allow for even more specificity, by not only measuring the ΔG values between whole sequences, but rather between the residues that are responsible for the binding between the 2 biomolecules. Our model have shown great results in terms of ΔG affinity prediction, reaching a testing accuracy of 97.75%.

Figure 3. Showing the results of our affinity regression model when applied to a dataset of antibody-protein complexes. (A) Shows a decrease in the Mean Absolute Squared Error, which could be used as a reference to predict the accuracy of the training results. (B) Shows decrease in Root-mean-square deviation (RMSE) indicating that the predicted values are nearer to the regression line. (C) Shows an increase in the accuracy of the training and validation scores over epochs. (D) Shows an increase in the r-squared scores of training and validation over epochs indicating that our model strongly explains the observed data.

Figure 4. Illustrating that our models was successfully able to predict the ΔG values of the formed complexes with high levels of accuracy and nearly no false predictions.

Figure 5. Showing the results of our affinity regression model when applied to a dataset of aptamer-protein complexes, including both DNA and RNA aptamers in the selected data. (A) Shows a decrease in the Mean Absolute Squared Error, which could be used as a reference to predict the accuracy of the training results. (B) Shows decrease in Root-mean-square deviation (RMSE) indicating that the predicted values are nearer to the regression line. (C) Shows an increase in the accuracy of the training and validation scores over epochs. (D) Shows an increase in the r-squared scores of training and validation over epochs indicating that our model strongly explains the observed data.

Figure 6. Showing our models ability to predict the ΔG values of aptamer-protein complexes. (A) Shows the prediction results of the training process of the model. (B) Shows our models performance when applied to another testing set and its ability to accurately predict the ΔG values of the formed complexes.

Prediction of Physico-Chemical parameters for DNA and RNA aptamer sequences:

The basis of the evolutionary process is based on the fact that random mutations lead to the formation of a wide variety of phenotypes. These different phenotypes have diverse properties that result in the final conformational and functional status of this selected phenotype. Physicochemical parameters are one of the most important of these properties, due to the fact that many of these parameters greatly affect the functional and structural conformation of those phenotypes.

This year, we’ve decided to apply the concept of directed evolution to the platform of DNA and RNA sequences and carry on last year’s work, where we applied the same concept to the proteins that were used in our previous project. This year, we’ve developed a new model for predicting the physicochemical properties of an RNA or DNA aptamer sequence and continued last year’s work, where we developed a model for predicting the properties of proteins.

This model was created to predict the changes in the properties of an aptamer sequence after introduction of random mutations to the introduced sequence. We created a function that uses the physicochemical properties of Di-Nucleotides to predict the overall properties of the introduced aptamer. And because the physical and chemical parameters that can be calculated for any biomolecule are quite diverse, we’ve decided to choose a number of parameters that have the most significant contribution to the structure and function of the resulting mutants. The parameters that we chose include molecular weight, physical instability, hydrophobicity, charge of molecule at physiological PH, Entropy calculations, stability and thermal capacity calculations, and finally flexibility calculations.

3- Anti-CRISPR Classification model for predicting Anti-CRISPR proteins:

In order to ensure the safety of our Cas12g-based therapeutic circuit and avoid the off-targeting potential of the CRISPR system, we’ve implemented a safety mechanism in the form of Anti-CRISPR proteins which regulate and control the activity of the CRISPR gene delivery system. That’s why we decided to develop a classification model for the identification of Anti-CRISPR protein sequences using deep learning. This approach was the reason why we developed the affinity model between the protein-antibody, so as to help us identify the affinity between the Anti-CRISPRs and the CRIPR system. Firstly, we started the training process of the model using datasets for Anti-CRISPR protein sequences, which we found during our search through literature. The training process used Long-Short term recurrent neural networks, after which the output is passed onto a linear activation function, and finally a log SoftMax function. The reason why we used the log SoftMax instead of SoftMax only is that it provides better numerical stability, optimization, and also works better if the provided values have a very large range. The training of the model showed an accuracy of 99.19% in terms of being able to differentiate whether the provided sequence is that of an Anti-CRISPR or not.

After this, we tested our model using another Anti-CRISPRs dataset so as to validate the results of our training. The model also showed great testing results with an Area Under the Curve (AUC) of 0.9987, as shown in the ROC curves below.

Figure 7. Shows the training pipeline for our classification model for predicting Anti-CRISPR proteins and the structure of our deep learning neural network. (A) Shows the training accuracy of the model, showing 99.87%. (B) Shows the decrease in the amount of lost data during the training process, thanks to the linear activation function and the log Softmax function.

Figure 8. Shows the testing results of our Anti-CRISPR model. (A) shows a ROC curve presenting the testing accuracy of our model to predict whether the provided sequence is truly an Anti-CRISPR protein or not, with an accuracy of 99.19%. (B) Shows a confusion matrix showing another representation of our results, with nearly no false positive or false negative predictions.

Integration of binding site prediction and affinity calculation models with our directed evolution algorithm:

In this notebook, we’ve integrated our aforementioned models into one user-friendly notebook, in which the user can just introduce the sequence of the protein of interest, and the notebook will automatically do the rest. Firstly, the user must load the necessary dependencies in order for the functions to work correctly. After this, loading of the models takes place. The binding site prediction model is loaded first in the from of BB3.h5 file, followed by loading of the aptamer affinity model called NA_AFF.h5, and finally the antibody affinity model called protaff.h5. So, the user must make sure that these files are loaded onto the used platform. Following this, the packages required for running the cells are imported. Then, the user is asked to insert their protein sequence into the target variable. The software then starts generating a random library of complementary molecules through a generator function, which could be in the form of DNA, RNA, or protein sequences, depending on the preference of the user. The type of the formed library is determined according to the user’s preferences. The user is also asked to provide a range of the generated sequences, bearing in mind that the bigger the sequence is, the longer it would take the software to work. In our example, we set the software to generate rna sequence with a range between 5-6 nucleotides.

Figure 9. shows the domain of the code in which the user is asked to submit the sequence of their protein. Note that the small red box at the bottom of the screen refers to the section which the user has to adjust as this will declare the nature of the formed random library, whether it is in the of rna,dna, or protein. So, the user must adjust this section accordingly. In addition, the two number indicate the range of nucleotides or the amino acids, which the user wants their library’s length to lie in between. In this example, the length of the formed rna nucleotides will lie between 5 and 6 nucleotides.

Then, binding site prediction process of the “target” sequence is initiated. Following this, functions encoding for physicochemical parameters of DNA, RNA, and proteins are identified. The user should only define the physicochemical parameters function corresponding to his choice in the previous cell, whether he chose the formation of a random library of DNA, RNA, or proteins. For example, if the user chose to generate RNA aptamers, he should define the RNANucIndex function, and vice versa. Physicochemical parameters measurement depends on many different properties and factors. That’s why we decided to apply Principal Components Analysis (PCA) technique to these parameters and compare between the formed mutants based on the summation of the values of these parameters into a single value. The PCA concept helps in the dimensionality reduction in machine learning and simplifies the complexity in high-dimensional data, such as that of calculating physicochemical properties of a library of mutants.

Following this, binding affinity calculations are carried out. These predictions measure the dG of the complex represented in the previously predicted binding site, and the library of random mutants formed by the generator function. The data is the presented in the form of a graph containing the following columns: Mutants, Binding Site (BS), Complexb of the BS and the mutants, the physicochemical properties calculated using PCA, and finally the affinity scores in dG.

Figure 10. shows the section that needs to be modified according to the library chosen by the user at the beginning of the model. This needs to be changed either to rna, dna, or protein, according to the user preference.

Following this, our directed evolution algorithm takes place in the form of one round of evolution for the top-ranking aptamer or antibody. This happens by introducing random mutations to each position in the chosen sequence separately and measuring the change in the physicochemical properties and the dG values. The user must also be careful to change the setting of his chosen data, whether it’s rna, dna, or protein.

Figure 11. shows the section that needs to be modified according to the library chosen by the user at the beginning of the model. This needs to be changed either to rna, dna, or protein, according to the user preference.

Then, the fitness of the mutants formed through the round of directed evolution is presented. The graphs show change in the fitness score of the mutant during the different generations of different mutants

Figure 12. shows an increase in the fitness scores of the formed generations of mutants after subjected them to a round of directed evolution.

Finally, we applied this model to our dataset of Anti-CRISPR protein, to improve the fitness of these proteins to increase the level of safety and security in our CRISPR regulated gene delivery system. This will not only help us choose the top-ranking Anti-CRISPR protein, in terms of binding affinity to its target, in addition to the best physical and chemical stability, but this will also allow us to subject it to a round of directed evolution to further improve these parameters

Figure 13. shows the evolutionary landscape in the form of a specific set of parameters. Note that the evolutionary algorithm is not directly used for the purpose of following the evolutionary process, but rather for the finetuning of the regression model that aids in the measurement of the dG of the mutant variants.

4- Lateral Flow Assay (LFA) Semi-quantification software:

We’ve also developed a user-friendly tool for analysing the results of the lateral flow assay test that detects phenylketonuria (PKU). The input of the software is in the form of an image of the test results, which can be uploaded onto the tool. After which, the software starts to do its magic. Firstly, it examines the quality of the provided image, whether it is of an acceptable and clear format, or if it has poor quality, in which case the tool would return an error message to the user asking him/her to provide another image with higher quality and resolution. After checking the quality of the image, the tool starts to analyse the test results themselves. This is done by calculating the number of pixels in both the control and the test lines. The decision is made depending on a pre-specified threshold that was chosen based on variable samples of phenylketonuria LFA test results showing variable colour intensities. Firstly, the tool begins to analyse the number of pixels in the control line to check for the validity of the test and whether there is good sample flow through the pads of the test or not. In case the number of pixels found in the control line is < 0.5, the test is considered invalid and should be done again by the user. Following this, the number of pixels in the test line are examined to see if the test is positive or negative. The output of the tool is in the form of 3 components. The first output is in the form of a text stating the exact sample band ratio and suggesting whether the test is a Positive PKU test or a Negative PKU test by comparing it with the pre-specified threshold. As for the second output, it is provided in the form of a graph showing the number of pixels in the test line in relation to the number of pixels found in the control line. The third and final output is in the form of an intensified image of the test line for further confirmation of its presence.

This software provides semi-quantification of the PKU lateral flow assay test and can be adapted to work on other LFA tests by applying the proper modifications according to the disease or condition that the user is tackling. In addition, our tool is very helpful for people with colour blindness by not only allowing our LFA results to be interpreted into numbers that can be easily identified by the user.

Figures 14. Demonstrating the user-friendly interface of our phenylketonuria lateral flow assay Semi-Quantification software. The input is in the form of a clear image of the LFA test and the output is in 3 different forms; output is in the form a text stating the positivity or negativity of the test, output 1 in the form of a line graph, output 2 in the form of a focused image of the test line

Figure 15.(Left one) Shows the results of the software when showing positive LFA test results.
Figure 16.(Right one) Shows the results of the software when showing negative LFA test results.

5- Whole-Cell biosensor (WCB) Semi-Quantification Software:

We went even further and created an interface for the semi-quantification of the results of a detection method using whole-cell biosensors. This model is very similar to our LFA semi-quantification model, however, it can be generalized to other WCB detection tests by modifying the threshold of detection based on the signal that indicates whether the test is positive or negative. The software uses a pixel intensity quantification coding for detecting the number of pixels in the input image and comparing it to a threshold that we chose based on the intensities resulting from detection of different concentrations of phenylalanine. The difference between this model and the LFA model is that the latter works by comparing the number of pixels in the test line with the number of pixels in the control line to determine the value of the test, while the WCB model measures the number of pixels resulting from the reaction between the WCB and the tested sample.

The software checks the intensity of the detected pixels and interprets these results into the probable concentration of the phenylalanine in the sample, measured in mg/dl. In addition, a statement is provided to advise the patient about the recommended next step in the management process. If the concentration is < 2 mg/dl, the test give that this is a NEGATIVE PKU test. If the concentration is between 2 and 6 mg/dl, the results show that it’s a positive PKU test and the patient likely has PKU phenotype. If the concentration is >= 6 until 12 mg/dl, the test shows positive results, with recommended dietary restriction. If the test results are between 12 and 20 mg/dl, the test recommends medical attention for risk of intellectual disability. Finally, if the test results are > 20mg/dl, the test gives a statement declaring that this is a classic PKU case, where urgent medical attention is needed.

Figure 17. Shows the different messages that the user will encounter after submission of different test results, showing different concentrations of the biomolecule of interest. We’ve tested our software using our whole-cell biosensor, which detected different concentrations of phenylalanine. (A) If the concentration is between 2 and 6 mg/dl, the results show that it’s a positive PKU test and the patient likely has PKU phenotype. (B) If the concentration is >= 6 until 12 mg/dl, the test shows positive results, with recommended dietary restriction. (C) If the test results are between 12 and 20 mg/dl, the test recommends medical attention for risk of intellectual disability. (D) if the test results are > 20mg/dl, the test gives a statement declaring that this is a classic PKU case, where urgent medical attention is needed. (E) If the concentration is <2 mg/dl, the test give that this is a NEGATIVE PKU test.

6- Novel Colorimetric analyser software:

Finally, we decided to create a software tool that can analyse the colour signal emitted from any colour-producing glassware such as a test tube or a culture media agar. But this tool requires the submission of 2 input images instead of one as in the aforementioned softwares. The pixel intensity quantification function of this software measures the number of pixels that contain any colours. The image input number 1 is the control image, which is used as a reference to analyse the colour of the other image. This control image should show the optimum results of your test so as to give the user a solid ground for the comparison process. The output of the test is produced in the form of a comprehensive analysis of the colour detected in both images. The first output measures the percentage of the most visible colour in the control image, while the second output measures the percentage of the most visible colour in the test image. As the colour that we’re searching for is that of the control line, the third output measures the ratio between the intensity of that colour in the test line, compared to the intensity of that same colour in the control line. In the provided example, the colour shown in the control image is a degree of the blue colour, so the colour we’re looking to analyse in the blue colour, consequently the number shown in the output 2 window is 0.0 as the test image has no blue colour. After this, the next outputs measure the intensity of the most visible colour in the control and test images, respectively. In our example, it measured the intensity of the blue colour in the control line, which was 224, while measuring the intensity of the green colour in the test line, which was 154. The following outputs are in the form of 2 histogram graphs of both the control and test images, showing graphical representation of the pixels found in both images. After this, the tool gives outputs in the form of a comparison between different wave lengths of the present colour and the absorbance of those wave lengths in both the control and test images. The final 2 output show 2 pie charts showing the percentage of the most visible colour in both the control and test images.

The most important output of the tool is the one labelled output 2, which measures the ratio between the colour of interest and the colour present in the test line. Another very important output of the tool is the graphs showing the relation between colour absorbance and their respective wave lengths as the colour of any of both images can be determined depending on the range of wavelengths in which the wavelength of the identified colour lies. However, there might be a slight deviation of these wavelength, due to the fact that the provided image might have a range of different colours which may alter the results of the produced graph.

Figure 18.Showing the interface of the colour analyser software. Our example shows an input of a blue-coloured control culture media, while the test culture media shows a green colour.

Figure 19.Showing outputs 7, 8, 9, and 10 of the software tool. (A) And (B) figures show 2 histogram graphs analysing the pixels intensity of the provided images. (C) And (D) graphs show a relation between the wavelengths of the detected colour in the control and test images, respectively, when compared to the absorbance of these colours.

Figure 20.Shows 2 pie charts demonstrating the percentage of the top 10 colours found in the control image (A) and the test image (B).

Our software pipeline and all the notebooks are now available on our team’s Gitlab software website. You can access our team’s software and notebook through this link:

Gitlab software website

Home

Project

Lab

Safety

Human Practices

Team

Software

Parts

Awards