Model

Modeling with AlphaFold and worm-like chain modeling

Modeling

The modeling for our project was divided into two major parts. The first part was the use of AlphaFold, an AI system predicting protein structure, to predict how the proteins in our project folds. The second part was worm-like chain (WLC) modeling. The WLC modeling was used for estimating optimal linker lengths between TEV halves and dCas9 proteins in the cell-free part of our project, which you can read more about here. Both modeling parts have been significant for understanding and predicting how our engineered systems would work.

AlphaFold

With the help of Johan Larsbrinck, associate professor at Chalmers University of Technology, AlphaFold[1] was used for predicting protein folding. AlphaFold is a software tool developed by the AI laboratory DeepMind, owned by Google. It uses an artificial neural network to accurately predict the structure of a protein in 3D by using the amino acid sequence of the protein.

dCas9 and TEV

The software was used to study dCas9-nTEVp and dCas9-cTEVp 3d structure since linkers and split proteins can cause folds in strange ways. Both these complexes showed unfavorable structures with the TEVp-halves placed in the middle of dCas9. For these two complexes unfortunately only the wrong order of proteins so nTEVp-dCas9 and cTEVp-dCas9 were tried instead of dCas9-nTEVp and dCas9-cTEVp.

nTEVp-dCas9

cTEVp-dCas9

TEVp is poorly soluble in water explaining why it was found in the hydrophobic center of dCas9 [2]. To combat this several mutations of TEVp [3] and another split protease, HRV 3C,[4] were tried. Only the n parts of TEVp were tried while both HRV 3C are shown.

dCas9-nSuMMV

dCas9-nSbMV

dCas9-nPPvP

dCas9-nHRV 3C

dCas9-cHRV 3C

The most probable predictions of all of these showed good results with separate proteins justifying a change to one of these should be made. HRV 3C cleaves another sequence then TEVp and leaves a longer tail in the form Gly-Pro. Thus the nSuMMV, nSbMV or nPPvP should be tried first in further lab experiments.

iGal

iGal was also studied, which showed firstly that the linker is easily accessible for cleavage. Secondly, the alpha (pink and purple) and omega (red) fragments could be placed as in the active enzyme; the pink region is the same sequence of amino acids in both iGal and B-gal. This could imply that iGal would always be active or instead of only being active after it gets cleaved. Below is iGal to the left and a normal beta galactosidase to the right. The yellow region is the cutting site for TEVp.


iGal (left) compared to B-gal [PDB:1F4H] (right).

iGal shall not be bound in its active form like it seems to be here before it is cleaved. Thus the linker in iGal should be shortened.

Worm-like chain modeling

Worm-like chain (WLC) modeling is used in polymer physics for modeling how polymers bend. In our project one program was written for simulating any polymer in 3D using WLC, and another program for simulating a specific system in the cell-free part of our project.

We have been in contact with the Tokyo team for collaboration, and they found that they could use our WLC modeling for analyzing a part of one of their systems. You can read more about the modeling collaboration on our partnership page here, and on Tokyo's page here.

This section will be divided into five main parts. First there is an introduction and the purpose of our modeling is explained. Secondly there is a description of how the modeling programs work and then how our modeling programs were used. Then the accuracy of the modeling is discussed. For those interested in the mathematics behind our models there is a theory section, and lastly the modeling method and possible future development is discussed.

To make it possible for others to use our WLC modeling, we have also designed a web app called PLOP. The main function of the web app is polymer length optimization. You can read more about it here. The source code for PLOP, and the modeling programs described on this page, can be found here.

Introduction

You can model how a polymer bends by modeling a polymer as a chain with small segments. For this to work the polymer has to be semi-flexible, which means that two successive segments will have approximately the same direction. For WLC modeling, the segment lengths are most commonly modeled with infinitesimal length [5].

All polymers do not bend in the same way. Some are stiff and others are flexible. The stiffness of a polymer will affect how much the polymer bend where one segment ends and the next begins. The more stiff a polymer is, the more likely it is that two successive segments will have very similar directions, which results in a less wrinkly polymer.

Some examples of semi-flexible polymers are DNA segments and amino acid sequences. Protein folding normally depends on within-protein bonds which makes WLC modeling not suitable. Protein linkers can however be modeled with WLC modeling, as protein linkers are normally intrinsically disordered which means that the linkers bend randomly.

When modeling a polymer as a chain, segment lengths are not related to parts of the polymer. For example when protein linkers are modeled the segment lengths are not related to amino acid lengths.

Aim

In our project the main use of WLC modeling was to examine how linker lengths affect the probability that two proteins come in contact. The proteins were TEV halves that were used in the cell-free method in our project, which you can read more about here. The goal was to find the linker length that would result in the highest possible contact probability, which in turn will likely yield the highest possible reaction rate of the proteins. For point-of-care diagnostic tests this is desirable as high reaction rate would lead to faster test results.

Programs

All modeling was done in Matlab. You can find all the code here. For those interested in using our modeling, we suggest first checking the web app. If you want to adapt our modeling for your purposes, I suggest using our programs written in Matlab instead. All programs are in the form of Matlab scripts.

WLC for endpoint distance

One program for WLC modeling was made for general purposes, it can be used for modeling any polymer. Hereafter it will be called the general purpose program. The user specifies a number of parameters for the polymer and run the program. The program then simulates different folds for the polymer. One output of the program plots a graph in 3D with one of the simulated folds. This can be used for confirming that the modeling works properly, as well as visually checking how much the polymer will bend.

The simulations are also used for calculating an approximation of how the distance between the polymer endpoints will be distributed. It generates and plots an estimation of the probability density function for endpoint distance.

TEV halves contact probability

For modeling the probability of TEV halves coming in contact, another program was written. Hereafter it will be called the TEV contact program. This program was based on the general purpose program, with some parts added. The program simulates a system. One part of the system is a DNA segment that consists of 30 base pairs. The DNA segment is modeled as a straight vector. The reason for this is that this is a rather good approximation for a DNA segment that only has about 30 base pairs. This can be shown with the general purpose program, and you can read more about that below.

There are binding sites for dCas9 associated with guide RNA at the ends of the DNA segment. One dCas9 is bound at each binding site. The dCas9 are modeled as straight vectors from the ends of the DNA segment to the dCas9 terminals connected to protein linkers. The two linkers bind the dCas9 proteins together with TEV halves. See the images below for clarification. They show a simulation of the system viewed from two different angles.

The TEV halves attached to each of the polymer ends are not shown in the image, but are modeled as spheres attached at the endpoints. Their radiuses are approximated as 1 nm. The probability that TEV halves come in contact is calculated by simulating the protein linkers a high number of times, and counting the number of times the proteins come close. The model assumes that the TEV halves are in contact if the distance between their surfaces are below 0.5 nm.

How the programs were used

If you have not yet read about the cell-free part of MOD3, we redirect you to the description page before continuing.

We used the general purpose program to model DNA. The DNA strand between binding sites for two dCas9 molecules are around 30 base pairs. When modeling a DNA strand with that number of base pairs our program showed that it will hardly bend at all, it will look like an almost straight line. See the image below.


The program's first output: a simulated DNA strand with 30 base pairs

DNA is stiff in comparison with proteins. There have however been some disagreement on how stiff DNA is [1]. For the results shown here we have used a stiffness that is close to the lowest estimations of DNA:s stiffness. The figure below shows the estimated probability density function of the DNA strand's end-to-end distance.


Estimation of probability density function to a DNA strand end-to-end distance with 30 base pairs

The result that the DNA strand is very straight was used when designing the TEV contact program. Modeling the DNA strand as a chain would introduce unnecessary complexity to the program as it is almost straight, so instead DNA is modeled as a straight line. It has the same length as the DNA strand's contour length. This implies that the DNA line has a slightly longer end-to-end distance than the expected end-to-end distance of the DNA.

The results of the TEV contact program is shown in the figure below. One of the outputs of the program is the probability of TEV halves being in contact. The figure shows probability estimates obtained by running the program with different linker lengths. The program uses 106 simulations for each linker length, and models the linkers as chains with 50 segments.


Probabilities of TEV halves being in contact for different linker lengths

From the graph we see that the model suggests that the optimal linker lengths are about 16-19 nm in contour length, with high contact probability for longer linkers as well. The contour length per amino acid varies from protein to protein, and has been measured to be 0.34-0.4 nm [6]. For a protein linker with length 16 nm, and contour length of 0.4 nm per amino acid, there are 16/0.4 = 40 amino acids. For a protein linker with length 19 nm, and contour length of 0.34 nm per amino acid, there are 19/0.34 ≈ 56 amino acids. An interpretation of these calculations is that the optimal number of amino acids is somewhere between 40 and 56 amino acids according to the model.

According to another source the average contour length per amino acid is 0.4 nm [6]. This would lead to the optimal number of amino acids being between 16/0.4 = 40 amino acids and 19/0.4 = 47.5 amino acids, and the results would be approximately the same as long as the number of amino acids are within that range.

There are however problems with choosing long linkers. The linker lengths suggested by the model are long in comparison with the length of the DNA segment modeled in the TEV contact program, which is approximately 10.2 nm. The linkers could get entangled with the proteins and DNA segment. The implications of the model might be that the linkers should be as long as possible, as long as they are short enough that they will not be entangled too much. Another conclusion is that it is likely better if the linkers are constructed with amino acids that make them stiffer. Stiffness of a polymer is determined by its persistence length. The persistence length of our linkers were 0.45 nm. A higher persistence length results in higher stiffness. With a persistence length of 0.6 nm for example, the optimal linker length would be reduced to about 11-13 nm, as shown in the figure below.


Probabilities of TEV halves being in contact for different linker lengths, with persistence length 0.6

Evaluation of modeling accuracy

The results above are only informative if the model is accurate. However, for evaluating the accuracy of our model empirical data is needed. For accuracy regarding a specific type of polymer, the reaction rates of protein-protein reactions in a system with that polymer, with several polymer lengths is needed. Unfortunately experimenting with different protein linker lengths in wet lab was beyond the scope of our project. Further experiments, or an extensive analysis of research done by others regarding polymer length optimization, is required for a good evaluation of the model's accuracy.

However, the accuracy of WLC modeling in general has been evaluated. In one study it was successfully used for modeling protein linkers with high accuracy [8]. In the study WLC modeling was used for computing the probability distribution for end-to-end distance of a polymer. In the study it was also used for simulating the chains for numerically approximating probability distributions. The simulations appear however to have been done in 2D, which suggest that our simulations made in 3D might give more accurate results. An important difference is that they adapted the stiffness of their modeled linkers for fitting experimental data, while in the modeling for this project the stiffness is based on the scientific measurements. This means that in the study they could regulate the modeling for optimally fitting the data, and the actual stiffness of their polymers might not be the stiffness that fit their data well. In our project it is assumed that the actual stiffness of the polymers will result in accurate simulations using WLC modeling.

Theory

The theory behind our modeling is divided in three parts. First there is the physical and mathematical background of WLC modeling. Then there are some explanations for the code. Lastly, background information is discussed.

WLC modeling

WLC modeling is based on how much energy it takes to bend polymers. For two connected chain segments, a larger angle between them corresponds to higher bending energy, which results in higher angles having lower probability. Under the assumption that there is no force on the chain, the energy for the whole chain can be expressed as $$E_N(\theta) = \frac{k_BTl_p}{2}\frac{L}{N}\sum_{n=1}^{N-1}(\frac{\theta_{n}}{L/N})^2$$ [9]. kB is Boltzmann's constant, T is the temperature in Kelvin, and lp is called persistence length. The value of the persistence length is dependent on the type of polymer. L is the contour length of the chain, and N is the number of chain segments. L/N is therefore the length of each segment. θn is the angle between segment n and n+1 in the chain, and θ is a vector of all θn values.

The theory of how you get the probability of a certain folding of the chain is derived from thermodynamics and statistical mechanics. According to statistical mechanics there are a finite number of possible foldings, otherwise the probability of any folding would be zero and a density function would be used instead. Given the energy EN(θ), we have $$P(\theta) \propto exp[-\frac{E_N(\theta)}{k_BT}] = exp[-\frac{l_p}{2}\frac{L}{N}\sum_{n=1}^{N-1}(\frac{\theta_{n}}{L/N})^2].$$ To model an example folding of a chain, you need to know the distribution for a single angle. To get the distribution, two connected segments are viewed as a separate chain. We then use the above formula, and get $$P(\theta) \propto exp[-\frac{l_p}{2}\frac{L}{N}(\frac{\theta}{L/N})^2].$$ There are significant similarities between P(θ) and the probability density of a normal distribution. The density of a normal distribution is described by the equation $$f(x) = \frac{1}{\sigma \sqrt{2\pi}}exp[-\frac{1}{2}(\frac{x-\mu}{\sigma})^2].$$ One of the assumptions in our modeling is that the chain angles can be generated from a density function. If we assume that θ is normally distributed, then x = θ in the expression above. μ is the expected value of θ. If negative values for θ are allowed, there will be an equal chance of θ being positive as negative, which means that the expected value of θ is zero, in other words μ = 0.

From the above explanation we get $$-\frac{1}{2}(\frac{x-\mu}{\sigma})^2 = -\frac{1}{2}(\frac{\theta}{\sigma})^2.$$ We also get $$exp[-\frac{l_p}{2}\frac{L}{N}(\frac{\theta}{L/N})^2] = \frac{1}{\sigma \sqrt{2\pi}}exp[-\frac{1}{2}(\frac{x-\mu}{\sigma})^2] \iff -\frac{l_p}{2}\frac{L}{N}(\frac{\theta}{L/N})^2 = -\frac{1}{2}(\frac{\theta}{\sigma})^2 \iff l_p\frac{N}{L} = (\frac{1}{\sigma})^2 \iff \sigma = \sqrt{\frac{L}{Nl_p}} .$$ The last equivalence is true as σ is positive for a normal distribution. As μ and σ are known, the distribution of θ can be expressed as N(μ, σ2) = N(0, L/Nlp).

The persistence length decides stiffness of a polymer. A longer persistence length corresponds to a higher stiffness and less bending. The persistence length of proteins is dependent on the amino acid sequence. The linkers binding TEV halves to dCas9 proteins have a persistence length of approximately 0.45 nm, based on the result of a study where similar linkers were examined [10]. As mentioned before, there are some debate about the stiffness of DNA. Different studies have gotten different results when estimating the persistence length of DNA. One of the lower estimates is 39 ± 3 nm [1]. For our modeling results we used persistence length 39 nm.

Mathematics

The angle θ between two segments describe how much the chain folds at the corresponding joint, but not in which direction. For the WLC modeling it is assumed that the direction is completely random. The length and direction of one segment can be described by the vector u = (u1, u2, u3). When generating the next segment in a chain we start by making it equal to the previous segment, described by u. Then we make a vector that is orthogonal to the previous vector and normalize it, $$n = \frac{1}{|| (1,-u_1/u_2,0) ||}(1,-u_1/u_2,0).$$ Note that u2 can not be zero.

It can be shown that n is orthogonal to u by taking the scalar product of the vectors and showing that it is zero. $$n \cdot u = \frac{1}{|| (1,-u_1/u_2,0) ||}(1,-u_1/u_2,0) \cdot (u_1, u_2, u_3) = \frac{1}{|| (1,-u_1/u_2,0) ||} (-u_1 + u_1) = 0.$$ Now we take the vector u and rotate it around n using Rodrigue's formula. Rodrigue's formula can be used if n is a unit vector, which is the reason it was normalized. In the matrix notation of Rodrigue's formula, a rotated vector can be calculated as urot,1 = R1u. R1 is called the rotation matrix and is defined as $$R_1 = I + \sin(\theta)K_1 + (1-\cos(\theta))K_1K_1 $$ [11] θ is generated from the normal distribution N(0, L/Nlp), and

When we have urot,1, we rotate the vector around u with an angle uniformly distributed on the interval (-π,π), as the direction the segment is bent towards should be completely random. For this we normalize u and use the matrix notation of Rodrigue's formula once more. Let $$u' = \frac{1}{|| u ||}u.$$ We then get the rotation of urot,1 as urot,2 = R2urot,1, where $$R_2 = I + \sin(\theta)K_2 + (1-\cos(\theta))K_2 K_2,$$

Statistics

The programs used for modeling make a large number of simulations. The statistical method for our modeling falls under the category bootstrapping. Bootstrapping is any test or metric using sampling with replacement. This is, as in our case, often done by using the same sampling method for generating a high number of samples. Our samples are used to create histograms that are then scaled to get estimations of probability densities.

Other background information

The following information have been used to make our modeling more accurate. When modeling the TEV completion system, how much DNA rotates per base pair was taken into account. When DNA is packed in chromatin, which it is in eukaryotic cells, there are 10.5 base pairs per turn [12]. That means it rotates with 2π/10.5 ≈ 0.6 radians per base pair.

In the cell-free part of the project two dCas9 molecules bind to DNA. Theoretically the system would work better if the dCas9 molecules are positioned on the same side of the DNA. In other words it would be optimal if the DNA rotates an even number of turns, which is accomplished if the number of base pairs in the DNA segment between the dCas9 molecules are k • 10.5, where k = 1,2,3… .For our project we chose to have DNA segments with approximately 30 base pairs. For computing our modeling results we therefore chose to optimize linker length for a length of 30 base pairs. It might have been better to have 31 or 32 base pairs, as 3 • 10.5 = 31.5 base pairs corresponds to an even number of turns.

Another important detail is the distance between proteins in contact. When two proteins interact it is called a protein-protein interaction, PPI. A configuration of two interacting proteins has a contact zone. The contact zone consists of the set of amino acids from the proteins that are within interaction distance. The interaction distance is usually spanning from 3-5 Å [12]. For our modeling we have, based on that information, assumed that the TEV halves are in contact if their surfaces are within 5 Å.

Summary of assumptions

There are some assumptions regarding any WLC modeling. The WLC model assumes that a modeled polymer is inextensible and has a linear elastic bending energy. Linear elastic bending energy means that the energy required for bending the polymer is proportional to how much the polymer is bent. It is also assumed that the polymer is subjected to thermal fluctuations [5]. The polymer is also modeled without volume, which means that a simulated polymer is modeled as if it could overlap itself. It is therefore assumed that this does not affect accuracy significantly. Some suggest that accuracy of WLC modeling is worse for a long and flexible polymer, as it would then overlap itself more [8].

For our modeling we have assumed that there is no force on any polymer. We have also assumed that a polymer can be modeled with good precision if the segment lengths are sufficiently small. As described in the theory section, the model assumes that the angles between chain segments can be generated from a Normal distribution. As WLC modeling assumes that polymers can overlap themselves, it is similarly assumed that proteins can be inside of one another, and polymers can overlap with proteins and DNA. In the TEV contact program for example, the spheres representing TEV halves can overlap.

In the TEV contact program, the two linkers are modeled with equal length. It is assumed that the highest contact probability of TEV halves will be achieved with equal linker lengths.

Other assumptions are that modeling proteins as spheres will not affect accuracy significantly, and that high contact probability will result in a high reaction rate even if protein's relative orientations are not considered.

Discussion and future modeling

There are some major advantages with our modeling method compared to other existing methods. In WLC modeling you normally do not model an entire path of a polymer. One common use of the WLC model is to calculate mean squared end-to-end distance mathematically [5]. Our WLC modeling on the other hand enables visualizing the results. Another important advantage of our modeling is that it enables future development where impossible paths are removed. You can exclude paths that are not physically possible because it goes through for example a protein or a membrane surface. This might also make it possible to model polymers with thickness, as we could ensure that the polymer does not overlap itself. Our modeling also makes it possible to get direction of a chain at its endpoints. In a future extension of our TEV contact program this could be used for calculating the probability that the TEV halves come into contact with correct relative orientations for the TEV halves to form a functional TEV protease.

Another extension of our TEV contact program would be to enable testing different linker lengths for each TEV half. It would then not be necessary to assume that equal linker lengths will result in optimal contact probability.

References

[1]: Gross P, Laurens N, Oddershede L, Bockelmann U, Peterman E, Wuite G. Quantifying how DNA stretches, melts and changes twist under tension. Nature Physics. 2011;7(9):731-736.

[2]: Nam H, Hwang BJ, Choi D-Y, Shin S, Choi M. Tobacco etch virus (TEV) protease with multiple mutations to improve solubility and reduce self-cleavage exhibits enhanced enzymatic activity [Internet]. FEBS open bio. John Wiley and Sons Inc.; 2020 [cited 2022Oct12]. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7137792/

[3]: Fink, T., Lonzarić, J., Praznik, A. et al. Design of fast proteolysis-based signaling and logic circuits in mammalian cells. Nat Chem Biol 15, 115–122 (2019). https://doi.org/10.1038/s41589-018-0181-6

[4]: Wang, S., Zhang, F., Mei, M. et al. A split protease-E. coli ClpXP system quantifies protein–protein interactions in Escherichia coli cells. Commun Biol 4, 841 (2021). https://doi.org/10.1038/s42003-021-02374-w

[5]: Milstein JN, Meiners J-C. Worm-like Chain (WLC) model. Encyclopedia of Biophysics. 2013;:2757–60.

[6]: M U. Contour length of amino acid - Generic - BNID 114332 [Internet]. Bionumbers.hms.harvard.edu. 2022 [cited 11 October 2022]. Available from: https://bionumbers.hms.harvard.edu/bionumber.aspx?s=n&v=5&id=114332

[7]: Ainavarapu S, Brujić J, Huang H, Wiita A, Lu H, Li L et al. Contour Length and Refolding Rate of a Small Protein Controlled by Engineered Disulfide Bonds. Biophysical Journal. 2007;92(1):225-233.

[8]: Lapidus L, Steinbach P, Eaton W, Szabo A, Hofrichter J. Effects of Chain Stiffness on the Dynamics of Loop Formation in Polypeptides. Appendix:  Testing a 1-Dimensional Diffusion Model for Peptide Dynamics. The Journal of Physical Chemistry B. 2002;106(44):11628-11640.

[9]: Marantan A, Mahadevan L. Mechanics and statistics of the worm-like Chain. American Journal of Physics. 2018;86(2):86–94.

[10]: Evers T, van Dongen E, Faesen A, Meijer E, Merkx M. Quantitative Understanding of the Energy Transfer between Fluorescent Proteins Connected via Flexible Peptide Linkers. Biochemistry. 2006;45(44):13183-13192.

[11]: En.wikipedia.org. 2022. Rodrigues' rotation formula - Wikipedia. [online] Available at: [Accessed 11 October 2022].

[12]: Levitt M. How many base-pairs per turn does DNA have in solution and in chromatin? Some theoretical calculations. Proceedings of the National Academy of Sciences. 1978;75(2):640-644.

[13]: Furmanová K, Byška J, Gröller E, Viola I, Paleček J, Kozlíková B. COZOID: contact zone identifier for visual analysis of protein-protein interactions. BMC Bioinformatics. 2018;19(1).