Introduction
In our project, FXR is designed to combine with ddRFPA1 through linker. Once activated by CDCA, FXR will undergo a conformational change and bind with RXR, driving the downstream ddRFP-A1 and B1 dimerize and emit fluorescence. However, we do not know much about the protein structure of FXR-ddRFPA1 and RXR-ddRFPB1, or the proteins' conformations at each stage of the experiment. Therefore, we hope to further study and analyze our key proteins FXR, RXR (PDB code: 5qok, 1fby) reliably to better understand the properties of the proteins. To achieve the goal, we use a Support Vector Machine (SVM) to predict the secondary structure of CDCA+FXR, CDCA+FXR+RXR. And we also try to predict the tertiary structure of FXR+ddRFP, RXR+ddRFP and CDCA+FXR.
Prediction of the Secondary structure
The connection between amino acids is involved in the docking process of CDCA-FXR and RXR, so as to the binding of CDCA and FXR. We wanted to understand the bound proteins. So we employed SVM to predict the secondary structure of the connection points.
SVM
Support Vector Machine (SVM) is a kind of generalized linear classifier which binaries data according to supervised learning. The decision boundary is the maximum-margin hyperplane that is solved for the learning samples. As an intelligent algorithm, SVM has high accuracy in binary classification problems.
How to use SVM to predict protein secondary structure?
1. According to the DSSP divided by the repeated pattern of hydrogen bonds in the three-dimensional structural coordinates of proteins, the secondary structure of proteins is divided into eight states: H (\(\alpha\)-helix), G (3-helix or \(3 _ {10}\) helix), I (5-helix or \(\pi\)-helix), E (extended strand participates in \(\beta\)-ladder), B(residue in isolated \(\beta\)-bridge), T(Hydrogen bond turn), S (bend) and -(other). The 8-state secondary structure is a refinement of the 3-state (H(\(\alpha\)-helix), E(\(\beta\)-fold), C(coiled)). In this part, we constructed a three-classification support vector machine model to predict the secondary structure of a given amino acid sequence.
2. We used OVR to achieve the triple classification function. The specific classification methods were as follow:
If H was classified as a positive sample, then E, C was divided into negative samples.
If E was classified as a positive sample, then H, C was divided into negative samples.
If C was classified as a positive sample, then H, E was divided into negative samples.
Input data D which is to be tested into three classifiers respectively, and then determine the category of the data according to which classification has a higher predictive value.
The Establishment of model
There were a total of 20 amino acids, and we used the 20-digit coding method to encode the amino acids. As shown in the following figure:
Kind | Letter representation | Encode |
Ala | A | 10000000000000000000 |
Cys | C | 01000000000000000000 |
Asp | D | 00100000000000000000 |
Glu | E | 00010000000000000000 |
Phe | F | 00001000000000000000 |
Gly | G | 00000100000000000000 |
His | H | 00000010000000000000 |
Ile | I | 00000001000000000000 |
Lys | K | 00000000100000000000 |
Leu | L | 00000000010000000000 |
Met | M | 00000000001000000000 |
Asn | N | 00000000000100000000 |
Pro | P | 00000000000010000000 |
Gln | Q | 00000000000001000000 |
Arg | R | 00000000000000100000 |
Ser | S | 00000000000000010000 |
Thr | T | 00000000000000001000 |
Val | V | 00000000000000000100 |
Trp | W | 00000000000000000010 |
Tyr | Y | 00000000000000000001 |
The secondary structure of amino acids at a specific position is affected by the properties of amino acids connected to it. Therefore, we constructed the data set by a sliding window. In order to improve the prediction accuracy as much as possible, we took the sequence of five adjacent amino acids as one data point. Taking the sequence of AVLIF as an example, if we intend to predict residue L in the central, then 2 amino acid residues would be selected from the left and right sides respectively, and there are 5 amino acid residues in total, with the coding result as follow:
10000000000000000000
00000000000000000100
00000000010000000000
00000001000000000000
00001000000000000000
If the first amino acid A is predicted, in order to make the number of windows 5, two zeros as amino acids are added before A, which equals to fill 40 bits of 0. The encoding result is:
00000000000000000000
00000000000000000000
10000000000000000000
00000000000000000100
00000000010000000000
The basic idea of SVM is to map the training data set nonlinearly to a high-dimensional feature space (Hilbert space). The purpose of this nonlinear mapping is to map the linear non-fractional data set in the input space to the high-dimensional feature space and then turn it into a linearly separable data set. Subsequently, an optimal separation hyperplane with maximum isolation distance is established in the feature space, which is also equivalent to generating an optimal nonlinear decision boundary in the input space. Let's take a two-dimensional plane as an example. As shown in the following figure:
Figure 1. Schematic Diagram of SVM Nonlinear Mapping
Figure 2. Schematic diagram of optimal separation hyperplane
Two key problems must be solved when constructing SVM. The first one is to find a nonlinear map to map the linear inseparable data set to the linear separable data set in the high-dimensional feature space. The second one is to find the optimal separating hyperplane in the high-dimensional space.
For the first problem, nuclear technology and methods can be applied to solve it. In this section, we used the radial basis (RBF) function to map amino acid sequence data into high-dimensional feature Spaces.
Radial nuclear:
\(K\left(x, x_{i}\right)=\exp \left(-\left\|x-x_{i}\right\|^{2} / 2 \sigma^{2}\right)\)
Optimization objective function:
\(J=W^{T} W=\|W\|^{2}\)
Constraints:
\(y_{i}\left[\sum_{i=1}^{N} \omega_{i} K\left(x_{i}, x_{i}\right)+b\right] \geq 1, j=1, \cdots,N \)
N is the total number of amino acids in the data set, W is the output adjustable parameter vector of SVM, \(x_{i}\) is the data composed of five adjacent amino acids (5×20 matrix), \(y_{i} \) is the corresponding secondary structure of \(x_{i}\) intermediate amino acids, which is defined as {1, -1} in a single classifier. The destination function is to ensure the optimality of the classification, and the constraints are to ensure the correctness of the classification. In order to improve the accuracy of classification, relaxation variables are introduced as follow:
\( J=\frac{1}{2} W^{T} W+C \sum_{i=1}^{N} \xi_{i} \)
\( y_{i}\left[\sum_{i=1}^{N} \omega_{i} K\left(x_{j}, x_{i}\right)+b\right] \geq 1-\xi_{i}, j=1,\cdots, N \)
\( \xi_{i} \geq 0, j=1, \cdots, N \)
Using Lagrange duality and KKT condition, the constrained optimization problem with the same solution can be obtained as follow:
\( \underset{\alpha}{\min} \frac{1}{2} \sum_{i = 1}^{N} \sum_{j = 1}^{N} \alpha_{i} \alpha_{j} y_{i} y_{j}\left(x_{l}^{\mathrm{T}} x_{j}\right)-\sum_{i = 1}^{N} \alpha_{i} \)
\( s.t. \sum_{i = 1}^{N} \alpha_{i} y_{i} = 0\)
\(0 \leqslant \alpha_{i} \leqslant C, \quad i = 1,2, \ldots, N\)
The optimum solution:
\(w^{*}=\sum_{i=1}^{N} \alpha_{i}^{*} y_{i} x_{i} b^{*}=y_{j}-\sum_{i=1}^{N} \alpha_{i}^{*} y_{i}\left(x_{i}^{\mathrm{T}} x_{j}\right) \)
Solution of model
The solution of SVM had been transformed into the solution of \(\alpha_{i}\), and we used the Sequential minimal optimization (SMO) algorithm to solve \(\alpha\). The calculation steps of SMO algorithm are as follow:
Step 1: Take the initial \(\alpha ^ {0} = 0, k = 0\).
Step 2: Select \( \alpha_{1}^{k} \),then select the point that violates this condition:
\( 0<\alpha_{i}^{*}< C \Rightarrow y_{i} g\left(x_{i}\right)=1\)
Step 3: Choose \(\alpha_{2}^{k}\),select other points in the data set, calculate\(|E 1-E _{i}|\), pick the action that maximizes its value \(\alpha_{2}^{k}\),
Where
\( E_{i}=\sum_{j=1}^{N} \alpha_{j}^{*} y_{j} K\left(x_{i}, x_{j}\right)+b-y_{i} \)
Step 4: Solve for the new \(\alpha_{2}^{n e w, u n c}\):
\(\alpha_{2}^{new, unc}=\alpha_{2}^{k}+\frac{y_{2}\left(E_{1}-E_{2}\right)}{\left.K_{11}+K_{22}-2 K_{12}\right)}\)
Step 5: Calculate \(\alpha_{2}^{k + 1}\):
\( \alpha_{2}^{k+1}=\left\{\begin{array}{ll} H & \alpha_{2}^{n e w, u n c}>H \\ \alpha_{2}^{\text {new }, \text { unc }} & L \leq \alpha_{2}^{\text {new }, \text { unc }} \leq H \\ L & \alpha_{2}^{\text {new }, \text { unc }}< L \end{array}\right. \)
Where H and L comply with the following rules:
IF (\(y_{1}*y_{2} = -1)\):
\( L=\max \left(0, \alpha_{2}^{\text {old }}-\alpha_{1}^{\text {old }}\right) \quad H=\min \left(C, C+\alpha_{2}^{\text {old }}-\alpha_{1}^{\text {old }}\right) \)
IF (\(y_{1}*y_{2} = 1)\):
\( L=\max \left(0, \alpha_{2}^{\text {old }}+\alpha_{1}^{\text {old }}-C\right) \quad H=\min \left(C, \alpha_{2}^{\text {old }}+\alpha_{1}^{\text {old }}\right) \)
Step 6: Use the linear relationship between \(\alpha_{2}^{k+1} \) and \(\alpha_{1}^{k+1}\) to find \(\alpha_{1}^{k+1}\):
\( \alpha_{1}^{new} = \alpha_{1}^{old} + y_{1}*y_{2}*(\alpha_{2}^{old} - \alpha_{2}^{new}) \)
Step 7: Calculate \(b^{k+ 1} \) and \( E_{i}\):
\( b^{\text {new }}=y_{1}-\sum^{m}_{i=3} \alpha_{i} y_{i} K_{i 1}-\alpha_{1}^{n e w} y_{1} K_{11}-\alpha_{2}^{n e w} y_{2} K_{21}\)
\( E_{i}=\sum_{S} y_{j} \alpha_{j} K\left(x_{i}, x_{j}\right)+b^{n e w}-y_{i} \)
Step 8: Check whether the following termination conditions are met within the range of precision e:
\( \sum_{i=1}^{N} \alpha_{i} y_{i}=0 \)
\( 0 \leq \alpha_{i} \leq C, i=1,2 \ldots N \)
\( \alpha_{i}^{k+1}=0 \Rightarrow y_{i} g\left(x_{i}\right) \geq 1 \)
\( 0< \alpha_{i}^{k+1}< C \Rightarrow y_{i} g\left(x_{i}\right)=1 \)
\( \alpha_{i}^{k+1}=C \Rightarrow y_{i} g\left(x_{i}\right) \leq 1 \)
Step 9: End if satisfied, return \(\alpha^{k+1}\), otherwise go to step 2.
Result
MATLAB was employed to implement the SMO algorithm. The test set with known secondary structure was used to test the model, and the test accuracy could reach 99.24%. The specific test results were put in the file Test results.txt. We used this model to predict the secondary structure of a protein, and the result was put into the file Predicted result.xlsx.
Tertiary structure Prediction
FXR+linker+ddRFPA1/ RXR+linker+ddRFPB1
Having obtained the amino acid sequences of FXR, RXR, linker, ddRFP, the first thing to build is the 3D structure of the proteins. So we turned to AlphaFold2. AlphaFold2 developed by the Deepmind team is one of the most accurate three dimensional structure prediction models for proteins to date, which has directly revolutionized the entire field of biology. AlphaFold2 greatly improves the accuracy of structure prediction by combining novel neural network architectures and training procedures based on evolutionary, physical and geometric constraints of protein structures.
Using AlphaFold2 with the amino acid sequences of FXR, linker and the subunit ddRFPA1, we obtained five PDB models and their corresponding indexes like predicted alignment error (PAE) and predicted confidence (pLDDT).
Figure 3. PAE of FXR+linker+ddRFP protein predicted by alphafold2
Figure 4. pLDDT of FXR+linker+ddRFP protein predicted by alphafold2
From the pLDDT figure, we could find out that most areas are above 85 points, indicating that the prediction results are reliable. While lower scores were found at both ends of the protein and at residues 245 and 460-480.
FXR is composed of ligand-binding domain (LBD) and DNA-binding domain (DBD). According to relevant literature, LBD is located at positions 245 to 476 of the amino acid sequence. We deduced that there was a flexible region between the two domains, that is, near the amino acid at position 245, leading to some errors in the evaluation. The 460-480 region is the site of linker we used, and the linker is flexible. Strictly speaking, such a flexible region does not have a stable conformation, so it is normal that it cannot be predicted. At both ends of the protein, the scores of predictions were also generally low. Therefore, the low scores at both ends of our prediction results are normal phenomena rather than inaccurate.
In the PAE figure, we could see that the distance between the predicted structure and the actual structure is within the range of 0-10A, indicating that the predicted structure is reliable.
The above figure shows that the five PDB models predicted by alphafold2 are almost identical, so we chose the model with the highest ranking (namely model 1) for further study. In the pLDDT plot of the rank1 model, you can see the scores of model 1.
Figure 5. pLDDT of FXR+ddRFPA1 prediction model rank1
High confidence intervals are painted blue, orange and yellow are the middle confidence intervals and red indicates low confidence. In our prediction, model structures of FXR and ddRFPA1 are shown in blue while the flexible 3*GS linker are colored red. Generally speaking, the regions with low confidence are intrinsically disordered regions (IDR) in proteins, which are generally flexible. In our prediction, these red areas are where the flexible linker we used is located, which is logical.
Similarly, we used the amino acid sequences of RXR, linker and the ddRFP subunit to obtain five PDB models and their corresponding index maps of PAE and pLDDT. From the PAE figure, the mistake between the predicted structure and the actual structure of the protein obtained by RXR combined with the linker and the subunit ddRFP is also within the range of 0-10A. As shown in the pLDDT chart, our prediction results were also relatively reliable.
Figure 6. PAE of RXR+linker+ddRFPB1 protein predicted by AlphaFold2
Figure 7. pLDDT of RXR+linker+ddRFPB1 protein predicted byAlphaFold2
Finally, the PDB file with the highest probability made by the prediction from AlphaFold2 was selected as the basis for the subsequent simulation and analysis.
Figure 8. FXR+linker+ddRFPA1
Figure 9. RXR+linker+ddRFPB1
FXR+CDCA
The protein structure will witness a change after the binding of FXR with CDCA. In order to clarify the tertiary structure of the protein after binding, we made a structural prediction of the post-binding protein.
We first used SwissDock to dock small molecule CDCA with protein FXR. SwissDock is a free online molecular docking tool developed by the Molecular Modeling Group at the Swiss Institute for BioInformation to simulate interactions between proteins and small molecules. The molecular docking algorithm used by SwissDock is called EADock, which is based on the sampling of the dihedral space and the properties of the target protein and the ligand. In the process of using SwissDock, two scores indicating energy and FullFitness respectively will be obtained. For both scores, the lower the better. And the scores of corresponding energy and FullFitness of the docking protein obtained by SwissDock docking were 9.94587 and -1127.6123 respectively.
Figure 10. FXR+CDCA
This predicted docking protein provided theoretical guidance for our initial experiment. Click here for more.
CDCA+FXR+ddRFPA1+RXR+ddRFPB1
In order to explore the weak fluorescence intensity of ESloop2, we established a theoretical docking model to solve the problem. We performed docking simulations for CDCA+FXR+ddRFP-A1 and RXR+ddRFP-B1using ZDOCK, which is a classic protein-protein docking program.
ZDOCK performed a full rigid-body search of docking orientations between two proteins. The current version 3.0.2, includes performance optimization and a novel pairwise statistical energy potential. Using ZDOCK, we obtained the protein structure shown below.
Figure 11. Prediction ofCDCA+FXR+ddRFPA1+RXR+ddRFPB1
For the predicted docked protein, we verified with the experiment and found that there was some deviation in the binding position of the two subunits of ddRFP, which might be the problem of the low fluorescence intensity in our experiment. Therefore, our prediction simulation work was helpful to Wet lab members.
Reference
Vapnik, Vladimir Naumovich. "Adaptive and learning systems for signal processing communications, and control." Statistical learning theory (1998).
Zhou, Zhihua. Machine Learning. Qing Hua Da Xue Chu Ban She, 2016, pp. 121-139, 401-492.
Hang, Li. Statistical-Learning, Qing Hua Da Xue Chu Ban She, 2012, pp.95-135.
Kabsch, Wolfgang, and Christian Sander. "Dictionary of protein secondary structure: pattern recognition of hydrogen‐bonded and geometrical features." Biopolymers: Original Research on Biomolecules 22.12 (1983): 2577-2637.
Linlin Wu, and Shuo Xu. "Protein secondary structure prediction based on multi-SVM ensemble." Chinese Journal of BioInformatics, 3 (2010): 187-190.
Haoran Zhang, Zhengzhi Han, and Changgang Li. " Support Vector Machine." Computer Science, 12 (2002): 135-138
Click to download Code and Result