Modeling with AlphaFold and worm-like chain modeling
The modeling for our project was divided into two major parts. The first part was the use of AlphaFold, an AI
system predicting protein structure, to predict how the proteins in our project folds. The second part was
worm-like chain (WLC) modeling. The WLC modeling was used for estimating optimal linker lengths between TEV
halves and dCas9 proteins in the cell-free part of our project, which you can read more about here. Both modeling parts have been
significant for understanding and predicting how our engineered systems would work.
With the help of Johan Larsbrinck, associate professor at Chalmers University of Technology, AlphaFold[1] was
used for predicting protein folding. AlphaFold is a software tool developed by the AI laboratory DeepMind, owned
by Google. It uses an artificial neural network to accurately predict the structure of a protein in 3D by using
the amino acid sequence of the protein.
The software was used to study dCas9-nTEVp and dCas9-cTEVp 3d structure since linkers and split proteins can
cause folds in strange ways. Both these complexes showed unfavorable structures with the TEVp-halves placed in
the middle of dCas9. For these two complexes unfortunately only the wrong order of proteins so nTEVp-dCas9 and
cTEVp-dCas9 were tried instead of dCas9-nTEVp and dCas9-cTEVp.
nTEVp-dCas9
cTEVp-dCas9
TEVp is poorly soluble in water explaining why it was found in the hydrophobic center of dCas9 [2]. To combat
this several mutations of TEVp [3] and another split protease, HRV 3C,[4] were tried. Only the n parts of TEVp
were tried while both HRV 3C are shown.
dCas9-nSuMMV
dCas9-nSbMV
dCas9-nPPvP
dCas9-nHRV 3C
dCas9-cHRV 3C
The most probable predictions of all of these showed good results with separate proteins justifying a change to
one of these should be made. HRV 3C cleaves another sequence then TEVp and leaves a longer tail in the form
Gly-Pro. Thus the nSuMMV, nSbMV or nPPvP should be tried first in further lab experiments.
iGal was also studied, which showed firstly that the linker is easily accessible for cleavage. Secondly, the
alpha (pink and purple) and omega (red) fragments could be placed as in the active enzyme; the pink region is the same sequence of amino acids in both iGal and B-gal. This could imply that iGal would always be active or instead of only being active after it gets cleaved. Below is iGal to the left and a normal beta galactosidase to the right. The yellow region is the cutting site for TEVp.
iGal shall not be bound in its active form like it seems to be here before it is cleaved. Thus the linker in
iGal should be shortened.
Worm-like chain (WLC) modeling is used in polymer physics for modeling how polymers bend. In our project one
program was written for simulating any polymer in 3D using WLC, and another program for simulating a specific
system in the cell-free part of our project.
We have been in contact with the Tokyo team for collaboration, and they found that they could use our WLC modeling for analyzing a part of one of their systems. You can read more about the modeling collaboration on our partnership page
here, and on Tokyo's page here.
This section will be divided into five main parts. First there is an introduction and the purpose of our modeling is explained. Secondly there is a description of how the modeling programs work and then how our modeling programs were used. Then the accuracy of the modeling is discussed. For those interested in the mathematics behind our models there is a theory section, and lastly the modeling method and possible future development is discussed.
To make it possible for others to use our WLC modeling, we have also designed a web app called PLOP. The main function of the web app is polymer length optimization. You can read more about it
here. The source code for PLOP, and the
modeling programs described on this page, can be found here.
You can model how a polymer bends by modeling a polymer as a chain with small segments. For this to work the
polymer has to be semi-flexible, which means that two successive segments will have approximately the same
direction. For WLC modeling, the segment lengths are most commonly modeled with infinitesimal length [5].
All polymers do not bend in the same way. Some are stiff and others are flexible. The stiffness of a polymer will affect how much the polymer bend where one segment ends and the next begins. The more stiff a polymer is, the more likely it is that two successive segments will have very similar directions, which results in a less wrinkly polymer.
Some examples of semi-flexible polymers are DNA segments and amino acid sequences. Protein folding normally depends on within-protein bonds which makes WLC modeling not suitable. Protein linkers can however be modeled with WLC modeling, as protein linkers are normally intrinsically disordered which means that the linkers bend randomly.
When modeling a polymer as a chain, segment lengths are not related to parts of the polymer. For example when protein linkers are modeled the segment lengths are not related to amino acid lengths.
In our project the main use of WLC modeling was to examine how linker lengths affect the probability that two
proteins come in contact. The proteins were TEV halves that were used in the cell-free method in our project,
which you can read more about here. The
goal was to find the linker length that would result in the highest possible contact probability, which in turn
will likely yield the highest possible reaction rate of the proteins. For point-of-care diagnostic tests this is
desirable as high reaction rate would lead to faster test results.
All modeling was done in Matlab. You can find all the code here.
For those interested in using our modeling, we suggest first checking the web app. If you want to adapt our modeling for
your purposes, I suggest using our programs written in Matlab instead. All programs are in the form of Matlab
scripts.
One program for WLC modeling was made for general purposes, it can be used for modeling any polymer. Hereafter
it will be called the general purpose program. The user specifies a number of parameters for the polymer and run
the program. The program then simulates different folds for the polymer. One output of the program plots a graph
in 3D with one of the simulated folds. This can be used for confirming that the modeling works properly, as well
as visually checking how much the polymer will bend.
The simulations are also used for calculating an approximation of how the distance between the polymer endpoints will be distributed. It generates and plots an estimation of the probability density function for endpoint distance.
For modeling the probability of TEV halves coming in contact, another program was written. Hereafter it will be
called the TEV contact program. This program was based on the general purpose program, with some parts added.
The program simulates a system. One part of the system is a DNA segment that consists of 30 base pairs. The DNA
segment is modeled as a straight vector. The reason for this is that this is a rather good approximation for a
DNA segment that only has about 30 base pairs. This can be shown with the general purpose program, and you can
read more about that below.
There are binding sites for dCas9 associated with guide RNA at the ends of the DNA segment. One dCas9 is bound at each binding site. The dCas9 are modeled as straight vectors from the ends of the DNA segment to the dCas9 terminals connected to protein linkers. The two linkers bind the dCas9 proteins together with TEV halves. See the images below for clarification. They show a simulation of the system viewed from two different angles.
If you have not yet read about the cell-free part of MOD3, we redirect you to the description page before continuing.
We used the general purpose program to model DNA. The DNA strand between binding sites for two dCas9 molecules are around 30 base pairs. When modeling a DNA strand with that number of base pairs our program showed that it will hardly bend at all, it will look like an almost straight line. See the image below.
The program's first output: a simulated DNA strand with 30 base pairs
Estimation of probability density function to a DNA strand end-to-end distance with 30 base pairs
The result that the DNA strand is very straight was used when designing the TEV contact program. Modeling the
DNA strand as a chain would introduce unnecessary complexity to the program as it is almost straight, so instead
DNA is modeled as a straight line. It has the same length as the DNA strand's contour length. This implies that
the DNA line has a slightly longer end-to-end distance than the expected end-to-end distance of the DNA.
The results of the TEV contact program is shown in the figure below. One of the outputs of the program is the probability of TEV halves being in contact. The figure shows probability estimates obtained by running the program with different linker lengths. The program uses 106
simulations for each linker length, and models the linkers as chains with 50 segments.
Probabilities of TEV halves being in contact for different linker lengths
From the graph we see that the model suggests that the optimal linker lengths are about 16-19 nm in
contour
length, with high contact probability for longer linkers as well. The contour length per amino acid varies from
protein to protein, and has been measured to be 0.34-0.4 nm [6]. For a protein linker with length 16
nm, and contour length of 0.4 nm per amino acid, there are 16/0.4 = 40 amino acids. For a protein
linker with length 19 nm, and contour length of 0.34 nm per amino acid, there are 19/0.34 ≈
56
amino acids. An interpretation of these calculations is that the optimal number of amino acids is somewhere
between 40 and 56 amino acids according to the model.
According to another source the average contour length per amino acid is 0.4 nm [6]. This would lead to the optimal number of amino acids being between 16/0.4 = 40 amino acids and 19/0.4 = 47.5 amino acids, and the results would be approximately the same as long as the number of amino acids are within that range.
There are however problems with choosing long linkers. The linker lengths suggested by the model are long in comparison with the length of the DNA segment modeled in the TEV contact program, which is approximately 10.2
nm. The linkers could get entangled with the proteins and DNA segment. The implications of the model
might
be that the linkers should be as long as possible, as long as they are short enough that they will not be
entangled too much. Another conclusion is that it is likely better if the linkers are constructed with amino
acids
that make them stiffer. Stiffness of a polymer is determined by its persistence length. The persistence length
of
our linkers were 0.45 nm. A higher persistence length results in higher stiffness. With a persistence
length of 0.6 nm for example, the optimal linker length would be reduced to about 11-13 nm, as
shown in the figure below.
Probabilities of TEV halves being in contact for different linker lengths, with persistence length 0.6
The results above are only informative if the model is accurate. However, for evaluating the accuracy of our
model empirical data is needed. For accuracy regarding a specific type of polymer, the reaction rates of
protein-protein reactions in a system with that polymer, with several polymer lengths is needed. Unfortunately
experimenting with different protein linker lengths in wet lab was beyond the scope of our project. Further
experiments, or an extensive analysis of research done by others regarding polymer length optimization, is
required for a good evaluation of the model's accuracy.
However, the accuracy of WLC modeling in general has been evaluated. In one study it was successfully used for modeling protein linkers with high accuracy [8]. In the study WLC modeling was used for computing the probability distribution for end-to-end distance of a polymer. In the study it was also used for simulating the chains for numerically approximating probability distributions. The simulations appear however to have been done in 2D, which suggest that our simulations made in 3D might give more accurate results. An important difference is that they adapted the stiffness of their modeled linkers for fitting experimental data, while in the modeling for this project the stiffness is based on the scientific measurements. This means that in the study they could regulate the modeling for optimally fitting the data, and the actual stiffness of their polymers might not be the stiffness that fit their data well. In our project it is assumed that the actual stiffness of the polymers will result in accurate simulations using WLC modeling.
The theory behind our modeling is divided in three parts. First there is the physical and mathematical
background of WLC modeling. Then there are some explanations for the code. Lastly, background information is
discussed.
WLC modeling is based on how much energy it takes to bend polymers. For two connected chain segments, a larger
angle between them corresponds to higher bending energy, which results in higher angles having lower
probability. Under the assumption that there is no force on the chain, the energy for the whole chain can be
expressed as $$E_N(\theta) = \frac{k_BTl_p}{2}\frac{L}{N}\sum_{n=1}^{N-1}(\frac{\theta_{n}}{L/N})^2$$ [9].
kB is Boltzmann's constant, T is the temperature in Kelvin, and lp is
called persistence length. The value of the persistence length is dependent on the type of polymer. L is
the contour length of the chain, and N is the number of chain segments. L/N is therefore the
length of each segment. θn is the angle between segment n and n+1 in the
chain, and θ is a vector of all θn values.
The theory of how you get the probability of a certain folding of the chain is derived from thermodynamics and statistical mechanics. According to statistical mechanics there are a finite number of possible foldings, otherwise the probability of any folding would be zero and a density function would be used instead. Given the energy
EN(θ), we have $$P(\theta) \propto exp[-\frac{E_N(\theta)}{k_BT}] =
exp[-\frac{l_p}{2}\frac{L}{N}\sum_{n=1}^{N-1}(\frac{\theta_{n}}{L/N})^2].$$
To model an example folding of a chain, you need to know the distribution for a single angle. To get the
distribution, two connected segments are viewed as a separate chain. We then use the above formula, and get
$$P(\theta) \propto exp[-\frac{l_p}{2}\frac{L}{N}(\frac{\theta}{L/N})^2].$$
There are significant similarities between P(θ) and the probability density of a normal
distribution. The
density of a normal distribution is described by the equation
$$f(x) = \frac{1}{\sigma \sqrt{2\pi}}exp[-\frac{1}{2}(\frac{x-\mu}{\sigma})^2].$$
One of the assumptions in our modeling is that the chain angles can be generated from a density function. If we
assume that θ is normally distributed, then x = θ in the expression above.
μ is the expected value of θ. If negative values for θ are allowed, there
will be an equal chance of θ being positive as negative, which means that the expected value of
θ is zero, in other words μ = 0.
From the above explanation we get
$$-\frac{1}{2}(\frac{x-\mu}{\sigma})^2 = -\frac{1}{2}(\frac{\theta}{\sigma})^2.$$
We also get
$$exp[-\frac{l_p}{2}\frac{L}{N}(\frac{\theta}{L/N})^2] = \frac{1}{\sigma \sqrt{2\pi}}exp[-\frac{1}{2}(\frac{x-\mu}{\sigma})^2] \iff -\frac{l_p}{2}\frac{L}{N}(\frac{\theta}{L/N})^2 = -\frac{1}{2}(\frac{\theta}{\sigma})^2 \iff l_p\frac{N}{L} = (\frac{1}{\sigma})^2 \iff \sigma = \sqrt{\frac{L}{Nl_p}} .$$
The last equivalence is true as σ is positive for a normal distribution. As μ and σ
are known, the distribution of θ can be expressed as
N(μ, σ2) = N(0, L/Nlp).
The persistence length decides stiffness of a polymer. A longer persistence length corresponds to a higher stiffness and less bending. The persistence length of proteins is dependent on the amino acid sequence. The linkers binding TEV halves to dCas9 proteins have a persistence length of approximately 0.45
nm, based on the result of a study where similar linkers were examined [10]. As mentioned before, there
are some debate about the stiffness of DNA. Different studies have gotten different results when estimating the
persistence length of DNA. One of the lower estimates is 39 ± 3 nm [1]. For our modeling results we
used persistence length 39 nm.
The angle θ between two segments describe how much the chain folds at the corresponding joint, but
not in which direction. For the WLC modeling it is assumed that the direction is completely random. The length
and direction of one segment can be described by the vector u =
(u1, u2, u3). When generating the next segment in a chain we start by
making it equal to the previous segment, described by u. Then we make a vector that is orthogonal
to the previous vector and normalize it,
$$n = \frac{1}{|| (1,-u_1/u_2,0) ||}(1,-u_1/u_2,0).$$
Note that u2 can not be zero.
It can be shown that n is orthogonal to u by taking the scalar product of the vectors and
showing that it is zero.
$$n \cdot u = \frac{1}{|| (1,-u_1/u_2,0) ||}(1,-u_1/u_2,0) \cdot (u_1, u_2,
u_3) = \frac{1}{|| (1,-u_1/u_2,0) ||} (-u_1 + u_1) = 0.$$
Now we take the vector u and rotate it around n using Rodrigue's formula. Rodrigue's
formula can be used if n is a unit vector, which is the reason it was normalized. In the matrix
notation of Rodrigue's formula, a rotated vector can be calculated as urot,1 =
R1u. R1 is called the rotation matrix and is defined as
$$R_1 = I + \sin(\theta)K_1 +
(1-\cos(\theta))K_1K_1 $$ [11]
θ is generated from the normal distribution N(0, L/Nlp), and
When we have urot,1, we rotate the vector around u with an angle uniformly distributed on the interval (-π,π), as the direction the segment is bent towards should be completely random. For this we normalize u and use the matrix notation of Rodrigue's formula once more. Let $$u' = \frac{1}{|| u ||}u.$$ We then get the rotation of urot,1 as urot,2 = R2urot,1, where $$R_2 = I + \sin(\theta)K_2 + (1-\cos(\theta))K_2 K_2,$$
The programs used for modeling make a large number of simulations. The statistical method for our modeling
falls under the category bootstrapping. Bootstrapping is any test or metric using sampling with replacement.
This is, as in our case, often done by using the same sampling method for generating a high number of samples.
Our samples are used to create histograms that are then scaled to get estimations of probability densities.
The following information have been used to make our modeling more accurate. When modeling the TEV completion
system, how much DNA rotates per base pair was taken into account. When DNA is packed in chromatin, which it is
in eukaryotic cells, there are 10.5 base pairs per turn [12]. That means it rotates with
2π/10.5 ≈ 0.6 radians per base pair.
In the cell-free part of the project two dCas9 molecules bind to DNA. Theoretically the system would work better if the dCas9 molecules are positioned on the same side of the DNA. In other words it would be optimal if the DNA rotates an even number of turns, which is accomplished if the number of base pairs in the DNA segment between the dCas9 molecules are
k • 10.5, where k = 1,2,3… .For our project we chose to have DNA segments with approximately
30 base pairs. For computing our modeling results we therefore chose to optimize linker length for a length of
30 base pairs. It might have been better to have 31 or 32 base pairs, as 3 • 10.5 = 31.5 base pairs
corresponds to an even number of turns.
Another important detail is the distance between proteins in contact. When two proteins interact it is called a protein-protein interaction, PPI. A configuration of two interacting proteins has a contact zone. The contact zone consists of the set of amino acids from the proteins that are within interaction distance. The interaction distance is usually spanning from 3-5
Å [12]. For our modeling we have, based on that information, assumed that the TEV halves are in contact
if their surfaces are within 5 Å.
There are some assumptions regarding any WLC modeling. The WLC model assumes that a modeled polymer is
inextensible and has a linear elastic bending energy. Linear elastic bending energy means that the energy
required for bending the polymer is proportional to how much the polymer is bent. It is also assumed that the
polymer is subjected to thermal fluctuations [5]. The polymer is also modeled without volume, which means that a
simulated polymer is modeled as if it could overlap itself. It is therefore assumed that this does not affect
accuracy significantly. Some suggest that accuracy of WLC modeling is worse for a long and flexible polymer, as
it would then overlap itself more [8].
For our modeling we have assumed that there is no force on any polymer. We have also assumed that a polymer can be modeled with good precision if the segment lengths are sufficiently small. As described in the theory section, the model assumes that the angles between chain segments can be generated from a Normal distribution. As WLC modeling assumes that polymers can overlap themselves, it is similarly assumed that proteins can be inside of one another, and polymers can overlap with proteins and DNA. In the TEV contact program for example, the spheres representing TEV halves can overlap.
In the TEV contact program, the two linkers are modeled with equal length. It is assumed that the highest contact probability of TEV halves will be achieved with equal linker lengths.
Other assumptions are that modeling proteins as spheres will not affect accuracy significantly, and that high contact probability will result in a high reaction rate even if protein's relative orientations are not considered.
There are some major advantages with our modeling method compared to other existing methods. In WLC modeling
you normally do not model an entire path of a polymer. One common use of the WLC model is to calculate mean
squared end-to-end distance mathematically [5]. Our WLC modeling on the other hand enables visualizing the
results. Another important advantage of our modeling is that it enables future development where impossible
paths are removed. You can exclude paths that are not physically possible because it goes through for example a
protein or a membrane surface. This might also make it possible to model polymers with thickness, as we could
ensure that the polymer does not overlap itself. Our modeling also makes it possible to get direction of a chain
at its endpoints. In a future extension of our TEV contact program this could be used for calculating the
probability that the TEV halves come into contact with correct relative orientations for the TEV halves to form
a functional TEV protease.
Another extension of our TEV contact program would be to enable testing different linker lengths for each TEV half. It would then not be necessary to assume that equal linker lengths will result in optimal contact probability.
[1]: Gross P, Laurens N, Oddershede L, Bockelmann U, Peterman E, Wuite G. Quantifying how DNA stretches, melts
and changes twist under tension. Nature Physics. 2011;7(9):731-736.
[2]: Nam H, Hwang BJ, Choi D-Y, Shin S, Choi M. Tobacco etch virus (TEV) protease with multiple mutations to improve solubility and reduce self-cleavage exhibits enhanced enzymatic activity [Internet]. FEBS open bio. John Wiley and Sons Inc.; 2020 [cited 2022Oct12]. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7137792/
[3]: Fink, T., Lonzarić, J., Praznik, A. et al. Design of fast proteolysis-based signaling and logic circuits in mammalian cells. Nat Chem Biol 15, 115–122 (2019). https://doi.org/10.1038/s41589-018-0181-6
[4]: Wang, S., Zhang, F., Mei, M. et al. A split protease-E. coli ClpXP system quantifies protein–protein interactions in Escherichia coli cells. Commun Biol 4, 841 (2021). https://doi.org/10.1038/s42003-021-02374-w
[5]: Milstein JN, Meiners J-C. Worm-like Chain (WLC) model. Encyclopedia of Biophysics. 2013;:2757–60.
[6]: M U. Contour length of amino acid - Generic - BNID 114332 [Internet]. Bionumbers.hms.harvard.edu. 2022 [cited 11 October 2022]. Available from: https://bionumbers.hms.harvard.edu/bionumber.aspx?s=n&v=5&id=114332
[7]: Ainavarapu S, Brujić J, Huang H, Wiita A, Lu H, Li L et al. Contour Length and Refolding Rate of a Small Protein Controlled by Engineered Disulfide Bonds. Biophysical Journal. 2007;92(1):225-233.
[8]: Lapidus L, Steinbach P, Eaton W, Szabo A, Hofrichter J. Effects of Chain Stiffness on the Dynamics of Loop Formation in Polypeptides. Appendix: Testing a 1-Dimensional Diffusion Model for Peptide Dynamics. The Journal of Physical Chemistry B. 2002;106(44):11628-11640.
[9]: Marantan A, Mahadevan L. Mechanics and statistics of the worm-like Chain. American Journal of Physics. 2018;86(2):86–94.
[10]: Evers T, van Dongen E, Faesen A, Meijer E, Merkx M. Quantitative Understanding of the Energy Transfer between Fluorescent Proteins Connected via Flexible Peptide Linkers. Biochemistry. 2006;45(44):13183-13192.
[11]: En.wikipedia.org. 2022. Rodrigues' rotation formula - Wikipedia. [online] Available at:
[12]: Levitt M. How many base-pairs per turn does DNA have in solution and in chromatin? Some theoretical calculations. Proceedings of the National Academy of Sciences. 1978;75(2):640-644.
[13]: Furmanová K, Byška J, Gröller E, Viola I, Paleček J, Kozlíková B. COZOID: contact zone identifier for visual analysis of protein-protein interactions. BMC Bioinformatics. 2018;19(1).