PROS by the Stony Brook University 2022 iGEM Team

Modeling

Introduction to Mathematical Modeling of Biological Systems

Mathematical biology uses theoretical analysis, mathematical models, and abstractions of organisms in order to investigate principles which influence the structure, development, and behavior of biological systems. Using programs like MATLAB, researchers can create mathematical representations of complex systems, and describe and analyze those systems quantitatively, allowing for precise simulations and predictions. Mathematical models have been used in a variety of ways: to validate hypotheses from experimental data, to test experimental predictions, and even to make observations about biological systems that are not always evident through experimental practices and procedures. Using mathematical biology, it is feasible to generate a system of equations which models the activity of a system within a cell. Dry lab procedures using these models can allow researchers to assess their proposed systems and experiments, and predict the outcomes prior to the commitment of physical, financial, and temporal resources in the wet lab (Tomlin and Axelrod 2007).

We used a variety of different mathematical models to better understand the limitations of our project and adjust our experiments to account for these limitations.


Models Describing Constitutive Gene Expression

Many biological projects require models that can accurately represent and predict multicomponent, temporally evolving, dynamic systems. Differential equation models are often used for this purpose. This method models the interaction of molecules in the form of rate equations. The system is represented by a system of ordinary differential equations (ODEs), which quantify the interaction between different molecules (i.e. DNA, mRNA or protein), by using the law of mass action. These equations include terms relating to the binding of transcription factors and RNA polymerase to DNA, interactions between transcription factors, mRNA translation rate, and mRNA and protein degradation rates (Ay and Arnosti 2011), for example. In order to accomplish this, knowledge about the system components and structure is required.

However there are a number of limitations and assumptions that need to be considered when using ODEs, outlined below:

  1. The reactions must be parameterizable. In other words, they need to be written as reaction equations with a known rate and known initial concentrations. Often these parameters are hard to find and measure so they need to be estimated.
  2. The reactions need to be “well-stirred”. This means that every component has equal opportunity to interact with any other component (think like a uniform distribution).
    • a. For example, one can use ODEs to model predator-prey relationships (how populations of wolves and rabbits change). If the system is “well-mixed”, then the rabbits and wolves are equally spread around the field. However, of all the rabbits were stuck in a corner, this would not be a good use of ODEs.
    • b. There are two ways to get around this constraint. First, you can either model each compartment as its own ODE, and then have reactions where a few molecules can move between compartments. Second, you can model movement explicitly with PDEs.
  3. The reactants need to be on a continuum scale. In biology terms that usually means dealing with concentrations rather than copy numbers. For example, it is perfectly reasonable to assume 0.5 M of a protein, but it makes no sense to talk about 0.5 proteins individually. This is usually a problem if you are looking at something that is very dilute. To get around this constraint, stochastic simulations are usually done.

Our Approach

Our project aims to synthesize protein S, a protein found in the human bloodstream, in SF9 and E. coli cells. We aimed to create a model that would help us compare the two systems and predict which cell line would be the most effective in producing a larger quantity of protein S; this means that we wanted to model gene (PROS1) expression and protein production. Therefore, for the dry lab portion of our project, we used an ODE model that prescribes to the central dogma of molecular biology (Li and Xie 2011), shown in our diagram below:

Figure 1. The central dogma of biology leads to a simplified version of constitutive gene expression.

Based on this schematic, we wrote a system of reactions that shows the production and degradation of DNA, mRNA, and protein. These reactions are as follows:

(1) DNAm → RNA+DNA (k1)

(2) mRNA → Protein+mRNA (k2)

(3) mRNA → Ø (k3)

(4) Protein → Ø (k4)

System of ODEs

These reactions were then translated into a system of ODEs using the law of mass action. The law of mass action states that the rate of a chemical reaction is directly proportional to the product of the activities or concentrations of the reactants (Aronson et al. 2016). When writing a system of ODEs, there is one equation for each species (in our case, DNA, mRNA, and protein). The rate of change of concentration of each species is proportional to (1) the stoichiometric coefficient times the reaction rate and (2) the product of the concentrations of the reactants. Based on these criteria, the ODE system we constructed is as follows, with 1:1 stoichiometry, where D = DNA, M = mRNA and P = protein:

d[D]/dt=0

d[M]/dt = k1[D] - k3[M]

d[P]/dt= k2[M] - k4[P]

It is important to note that the rate of change of DNA is zero, which means that DNA concentration remains constant in our model. Constant.

Parameter Definition

We conducted a literature search in order to find the rate constants for each equation. The rates that we needed to find were the rates of transcription, translation, mRNA degradation, and protein degradation. For our gene of interest, PROS1, the production of mRNA in E.coli was driven by T7 polymerase, which binds to the promoter PT7 and by a baculovirus-dependent RNA polymerase, which binds to the polyhedrin promoter in SF9 cells. We used this system to model gene expression for both SF9 and E.coli cells. SF9 cells are insect cells from the organism Spodoptera frugiperda. We had some difficulty finding established rates for this cell line, and decided to use the rates of Drosophila melanogaster, a similar insect, as a substitute. k1 is the rate of transcription, k2 is the rate of translation, k3 is the rate of mRNA degradation, and k4 is the rate of protein degradation. The parameter values which we used are as follows:

 
SF9 Spodoptera frugiperda Escherichia coli
k1 = 0.06613 mM/s (Jackson et al. 2000) k1 = 0.038532 mM/s (Garcia et al. 1995)
k2 = 5 aa/s = 7.6790383e-12/s (Olofsson et al. 1987) k2 = 8 aa/s = 3.0243597e-7/s (Guet et al. 2008)
k3 = half-life of 35 min = 0.00033007008/s (Beadle et al. 2022) k3 = half-life of 4 min = 0.00288811325/s (Bernstein et al. 2004)
k4 = half-life of 10hr = 0.00001925408/s (He et al. 2019) k4 = half-life of 70hr = 0.00000275058/s (Maurizi 1992)

Table 1. Parameter definitions for E.coli and SF9 cells.

In order to convert the rate constants from aa/s to M/s, we did the following calculations:

 
Initial Conditions
Species DNA Concentration (mM) mRNA (mM) Protein (mM)
Value 0.00077042 0 0

Table 2. Initial values of all the species modeled in the simulation.

Steady State Analysis

When examining the behavior of a system, it is useful to look as time approaches infinity, also called the asymptotic behavior of a function. Over the course of a certain time, the system should adopt a steady state behavior, where there is no net change in protein concentration. We can analyze this steady state behavior by setting our ODE’s to zero (Bloch 2011). Steady-state protein expression is determined by four rates: transcription, translation, mRNA degradation, and protein degradation (Hausser et al. 2019). A steady state means that there is no net change of a particular parameter in a biological system. There are stable and unstable steady states. In an unstable steady state, a small disturbance will lead to a large change. In a stable steady state, a small disturbance will cause a small change initially, but the system will ultimately converge back to the steady state (Owen 2012). While the system is in a steady state, the amount of protein produced may remain constant because translation and degradation rates are balanced (Vogel et al. 2012). Given our ODE for mRNA, d[M]/dt = k-γM, with k denoting the rate of transcription and , representing the degradation rate of mRNA, to determine the steady state of M, we can set it equal to 0 → Mss=k/γ. We get the same result when we solve the ODE, getting M(t)=Ce-γt+k/2. When we look at the limit as t approaches infinity (∞), or becomes arbitrarily large, Mss=lim(t→∞M(t)=Ce-γt+k/γ, we can say that Ce-γt is 0. Therefore, Therefore Mss=k/γ. This relationship is shown in the growth curve below:

Figure 2. Concentration vs. time plot.

In order to determine our proper value protein S concentrations, we set our mRNA and our protein differential equations equal to zero, shown below:

d[M]/dt = k1[D] - k3[M] = 0

Mss=k1D/k3

d[P]/dt= k2[M] - k4[P]=0

Pss=k2M/k4=k1k2D/k3k4

With all of our parameters defined for both E.coli and SF9 cells, we were able to input our system of ODE’s into MATLAB to predict our steady state concentrations prior to starting wet lab in order to determine the process that would yield more favorable results. Those plots can be found below:

Analytical and Numerical Analysis

Figure 3. Plot showing DNA, mRNA, and Protein vs. time in E.coli cells.

Figure 4. Plot showing DNA, mRNA, and Protein vs. time.

Figure 5. Plot showing mRNA vs. time in E.coli.

Figure 6. Plot showing protein vs. time in E.coli cells.

Figure 7. Plot showing DNA, mRNA, and vs. time in Sf9 cells.

Figure 8. Plot showing DNA vs. time in Sf9 cells.

Figure 9. Plot showing DNA vs. time in Sf9 cells.

Figure 10. Plot showing protein vs. time in Sf9 cells.

Altering DNA Concentration

After analyzing our plots for protein production in E.coli and SF9 cells, we consulted with our wet lab team and decided to simulate what the outcome would be if we altered the DNA concentration by utilizing a ‘for loop’ in MATLAB. We did this as a stand-in for altering cell concentration. Modeling this increase in protein levels gave us valuable insight as well as practice to begin work on our more sophisticated mathematical model, where we directly alter cell concentrations using a fully stochastic model with varying cell population and cell growth, therefore mimicking a more realistic biological system. A ‘for loop,’ used in this simulation, allowed us to repeat our ODEs and have MATLAB run the code for each stated value. We kept our same system of ODEs that we used to determine our steady state values for protein production, and multiplied our initial DNA concentration of 0.00077042 mM by 1, 2, 50, 100, 200, 500, 1000, 5000. We then extracted the steady states and calculated the mean of the last five points and plotted them on the graph pictured below. We saw a linear increase in protein levels in both E.coli and SF9 cells as we multiplied the initial DNA concentration by our arbitrary values

Results

Figure 11. Plot showing protein level vs. DNA concentration after altering DNA concentration in E.coli cells.

Figure 12. Plot showing protein level vs. DNA concentration after altering DNA concentration in SF9 cells

Analysis

In analyzing these E.coli and SF9 plots, at ~3.8mM of DNA concentration, protein S reached a level nearing ~5.8mM. For SF9, at the same level of ~3.8mM DNA concentration, protein S reached a level of ~3.2x10-4. Based on our modeling of protein expression in E. coli and in SF9, we came to a conclusion about their relative expression rates and volumes: E. coli is more time and cost-effective in producing a protein. We therefore decided to focus our efforts on E. coli in cell expression of protein S as we will need a significant yield of the protein for purification.

E. coli machinery is less compatible with producing human protein than evolutionary younger SF9 cells. Because of this, we chose to keep working on SF9 for comparison, and continued to model their distinct gene regulatory network.


Modeling Gene Regulatory Networks

We aimed to model the specific gene regulatory networks for SF9 and E.coli. Both these models are outlined in the sections below. Modeling constitutive gene expression assumes that the genetic circuit is always being expressed at a basal level regardless of the cellular environment and is under no regulation by the cell. Constitutive genes are among the most important elements of a cellular genome, which code for proteins and other molecules that are extremely necessary to maintain a cell's basic metabolism and survival. However, many genes are only needed occasionally and their expression is regulated by the cell's microenvironment. Initial states of regulatory gene circuits can be either ‘on’ or ‘off’ and can be regulated by repressor or activator molecules accordingly. Therefore, we wanted to model both constitutive and regulatory gene circuits for SF9 and E. coli, and compare basal vs regulated protein production values.


Modeling the SF9 Gene Regulatory Network

Overview

The majority of recombinant protein expression is driven by the most powerful baculovirus promoter, polyhedrin, which is active in the late and very late stages of infection. Several studies have been done to analyze the mechanism of the polyhedrin promoter to better characterize its transcription activity. A host secreted transcription factor, polyhedrin promoter binding protein (PPBP) is known to bind a specific sequence on the polyhedrin promoter with extremely high affinity and specificity and plays a major role in the level of transcription through the polyhedrin promoter. It is established that the PPBP binds to the minor groove of DNA, interacting with the polyhedrin promoter sequence and forming a complex. Further studying the PPBP-DNA interactions and manipulating the concentration of PPBP in host SF9 cells can have a significant impact on the yield of recombinant proteins.

Here, we modeled the interaction between the PPBP and the polyhedrin promoter and its effect on the rate of production of mRNA and the resulting protein S. We were unable to find an established literature value for the concentration/number of molecules of the PPBP in SF9 cells, therefore, we estimated the concentration of PPBP based on some commonly found transcription factors in Drosophila Melanogaster.

System of ODE’s and Deriving the Hill Function

A Hill function determines the degree of cooperativity of the ligand(s) binding to the enzyme or receptor. There is a Hill function for an activator and for a repressor, called h+, and h-, respectively. We used the activator hill function to model the regulatory gene circuit in SF9 cells.

The ODEs were as follows:

Parameter Estimation

For the parameters, we assumed the value to be the maximum transcription rate k1. The Michaelis Menten constant, K, takes into account the energetics of binding between two particular proteins. We were unable to find the Michaelis Menten constant for the PPBP-polyhedrin promoter interaction, therefore, we used the dissociation constant of the PPBP-Polyhedrin complex found in literature. For the Hill coefficient, unless another value is found in literature, you assume a value of 2 for transcription factors.

Parameter Parameter Name
α Maximum expression of the promoter/transcription factor complex
K Dissociation Constant
x Concentration of Transcription Factor
n Hill Coefficient

Table 3. Parameter Definitions.

Steady State Assumption and Initial Concentrations

For the initial concentration of the promoter, we found the literature value for the maximum number of virions per SF9 cell, and one promoter molecule per virion. The number of Polyhedrin promoter molecules is 55 (Zhang et al. 2022) and the initial concentration is 2.2832 x 10-8 mM. The volume of a typical mammalian cell is 4000 um3 / 4 x 10-12 L, and this value was used for the unit conversions, shown below (Luby-Phelps 2000).

For the initial steady-state concentration of the PPBP transcription factor, we assumed the literature value for the endogenous transcription factors Eve and Ftz found in Drosophila melanogaster. The number of molecules of PPBP is 50,000 molecules (Walter et al. 1994) and the initial concentration is 2.0755 x 10-5 mM.

Rate Constants

Rate constants Value
k1(Transcription Rate) 0.06613 mM/s (Jackson et al. 2000)
k2(Translation rate) 5 aa/s = 7.6790383e-12/s (Olofsson et al. 1987)
k3(mRNA degradation rate) 0.00033007008/s (Beadle et al. 2022)
k4(protein degradation rate) 0.00001925408/s (He et al. 2019)
K (Dissociation Constant (Promoter TF) 3.7 x 10-12 M (S Burma 1994)

Table 4. Rate constants for PPBP and the polyhedrin promoter interactions.

 
Initial Conditions
Species PPBP (mM) Polyhedrin (mM) mRNA protein
Value 2.0755 x 10-5 2.2832 x 10-8 0 0

Table 5. Initial values of all the species modeled in the simulation.

Results

Figure 13. mRNA vs protein S concentrations over time.

Figure 13 plots protein S concentration as a function of mRNA concentration and time (P = f(M,t), describing how the protein concentration evolves as a result of different initial mRNA concentrations over time.

Figure 14. mRNA concentration over time. This graph shows mRNA levels approaching its steady state value.

Figure 15. Protein S concentration over time. This graph shows protein S levels approaching its steady state concentration.

The dynamics of mRNA production and degradation are much faster than that of the protein, therefore, mRNA levels reach their steady state concentrations much faster than the proteins do.

Summary

Modeling the SF9 regulatory gene circuit vs the SF9 constitutive gene circuit provides insight into the role of the polyhedrin promoter binding protein (PPBP) and its effect on driving robust expression through the polyhedrin promoter. This is evident by the difference in the final steady-state protein values plotted by the regulatory vs the constitutive model. However, the exact dynamics of PPBP and polyhedrin promoter interactions are not well understood, so it was difficult for us to find and approximate the literature values for initial steady state concentrations. Future models and experiments would seek to gain a better understanding of these values in order to make our simulation more accurate and improve our model.

Figure 16. Comparing final protein production amounts for SF9 constitutive vs regulatory gene circuits. (plot on the left shows regulatory expression values and plot on the right shows constitutive expression values)


Modeling the E.Coli Lac Operon Inducible Gene Network

Overview

Here, we modeled our T7 promoter regulatory gene circuit which employs transcriptional control by a repressor protein. The lac operon circuit consists of an operator site in the promoter region, to which the lacI repressor protein can bind and establish transcriptional repression. For initiating transcription, an inducer must bind to the lac repressor protein (lacI) which prevents the repressor from binding to the operator. The T7 promoter is induced using IPTG, which mimics allolactose and binds the lac repressor protein, allowing for the T7 RNA polymerase to bind to the promoter and induce transcription of the downstream gene sequence.

System of ODE’s and Deriving the Hill Function

The T7 lac operon is modeled using a repressor hill function, which takes into account the interactions of IPTG and lacI repressor protein.

The ODEs were as follows:

Parameter Estimation

Parameter Parameter Name
α Maximum expression of the promoter/transcription factor complex
K Michaelis Menten Constant
x Concentration of Transcription Factor
n Hill Coefficient

Table 6. Parameter Definitions

For the parameters, we assumed the value to be the maximum transcription rate, k1. The Michaelis Menten constant, K, takes into account the energetics of binding between two particular proteins. We obtained the Michaelis-Menten constant from a literature source. For the Hill coefficient, unless another value is found in literature, you assume a value of 2 for transcription factors. kon and koff are the association and the dissociation rate constants for the formation of IPTG/lacI complex. We set the initial concentration of IPTG to be 0.5 mM in order to determine a suitable concentration to optimize expression.

Rate Constants

Rate constants Value
k1 (Transcription Rate) 0.038532 mM/s (Garcia et al. 1995)
k2 (Translation rate) 8 aa/s = 3.0243597e-7/s (Guet et al. 2008)
k3 (mRNA degradation rate) 0.00288811325/s (Bernstein et al. 2004)
k4 (protein degradation rate) 0.00000275058/s (Maurizi 1992)
kon (association rate constant) 0.011 mM-1s-1(Xu and Matthews 2009)
koff (dissociation rate constant) 0.3 s-1 (Xu and Matthews 2009)
K (Michaelis-Menten constant) 0.001 mM (Semsey et al. 2013)

Table 7. Rate constants for T7 lac operon interactions.

 
Initial Conditions
Species lacI (mM) mRNA (mM) [lacI][IPTG] (mM) [IPTG] (mM)
Value 0.00077042 0 0 0.5

Table 8. Initial values of all the species modeled in the simulation.

Results

Figure 17. mRNA vs protein S concentration over time. This graph demonstrates an increase in the levels of mRNA and the corresponding protein levels post-induction using IPTG.

Figure 17 plots protein S concentration as a function of mRNA concentration and time (P = f(M,t), describing how the protein concentration evolves as a result of different initial mRNA concentrations over time.

Figure 18. mRNA production over time, this graph demonstrates the production of mRNA after IPTG induction and the time it takes to obtain steady state levels of mRNA.

Figure 19. Protein S production over time, this graph demonstrates the production of protein S from our gene circuit after inducing with 0.5 mM of IPTG and time taken to reach the steady state concentration.

Limitations of the model

Modeling the E. coli lac operon regulatory gene circuit vs the constitutive gene circuit provides insight into the role of IPTG induction and its interaction with the repressor protein to induce expression from the T7 promoter. However, the protein production values from the regulatory model simulations are much higher (~ 4 mM) than the values from the constitutive model simulations (1.1 10-3mM). We suspect that some errors in our parameter estimations might be causing the anomaly in the protein production values.

Conclusion

The graphs generated from our MATLAB simulation state that our gene circuits work well and protein S can be successfully expressed using the vector in E coli. We used this data obtained by mathematical modeling of the circuit kinetics to get an understanding of how our gene circuit works and used it to design our wet lab experiments. Based on this dry lab data, we used 0.5 mM IPTG for inducing E coli origami B cells and obtained successful expression of protein S. Our hope is that other iGEM teams can use our model to better characterize their gene expression circuits which would help them design their wet lab experiments. Moreover, we hope that future iGEM teams can build upon our SF9 regulatory model using more accurate parameter estimations and test the circuits using wet lab experiments.


Stochastic Modeling

A fully stochastic model can also be used to project the probability distributions of the potential results by allowing for random variation in at least one of the inputs over the course of a set time scale (Gillepsie 1977). These variations resemble a more realistic biological system because it accounts for noise and non-genetic heterogeneity. Using this simulation, we are able to estimate the impact of cell concentration on protein production using the initial conditions of D0,M0,P0,k1,k212. Our ODEs for this model were not changed, but we did need to slightly reformat them to fit. We set our reactions as shown below:

We decided to work on our model from the inside out, meaning that we created the while loop in MATLAB that would simulate the molecular level of our system, then formatted the for loop, which would update every cell by calculating the various protein levels based on the Gillespie Algorithm which would be written into our final while loop which encloses our whole system. The output of the stochastic simulation is a matrix with many rows that are constantly updating the cells DNA, mRNA, and protein levels. We began by initializing our cells. We decided to set three different normal distributions for DNA, mRNA, and protein, with two parameters for each distribution: the mean and standard deviation.

We set the following initializations in MATLAB:

Variable Initial Value Description
n cells 100 Initial amount of cells
DNAvals normrnd(μ,σ,n cells;1) Generates a random number distribution with mean, standard deviation, and cells for DNA values
mRNAvals normrnd(μ,σ,n cells;1) Generates a random number distribution with mean, standard deviation, and cells for mRNA values
proteinvals normrnd(μ,σ,n cells;1) Generates a random number distribution with mean, standard deviation, and cells for protein values
CS = [DNAvals, mRNAvals, proteinvals] 1 Cell State
P_growth k/k+γ(N-1) Growth Propensity
P_death γ(N-1)/k+γ(N-1) Death Propensity
t(1) 0 Time
simtime 43200s Simulation Runtime
ctr 1 Iteration
max ctr 7000 Max Number of Iterations

Table 9. Initializations in MATLAB.

Our initial cell count was set to 100, and we created a matrix with which a value at random would be chosen from the distributions for our initial DNA, mRNA, and protein concentrations. ‘Ctr’ is the number of iterations and we decided to set out ‘max ctr’ to 7000, so that the simulation time would be reasonable. We used the Gillespie algorithm which uses probability to accurately predict the trajectory of a stochastic model (Gillepsie 1977).

Once we wrote the code for our while loop, we used a for loop to update the cells, each cell is updated once and the number of cells that are being updated is changing with each time step.. Every cell in the current population updates its DNA, mRNA, and protein levels based on the Gillespie algorithm. Following our for loop, we chose a value at random from a uniform distribution and if that value was less than or equal to the growth propensity, which we calculated to be 4.66510-4/4.66510-4+4.8611110-6(n-1) , then one reaction would occur and the cell state would update. If that was not the case, a different reaction would occur and the cell state would be updated.

An important component of our model was determining what the time-step would be. Time dependent models require a time-step to accurately predict a system outcome. A time-step relies on many factors and it is usually more beneficial to use a smaller timestep to limit unstable assumptions and solutions in a system. One caveat to this is that shorter time-steps elicit longer simulation times which requires greater computational power (Hockley 2021). We calculated our time step to be, tau=[1/(Pgrowth+Pdeath]x log[1/(1-u)], with our death propensity being [5.1841910-15 x (n-1)/4.66510-4+5.1841910-15(n-1)], which is derived from the poisson distribution. We then multiplied this value by 8000 to speed up run time. For this reason, our model is not as accurate as it could be; we were limited by computational power. In the future, running this simulation with an accurate timestep would increase the accuracy of our results, better modeling cell count, DNA, mRNA, and protein amounts.

Results

From our model, we wrote the script for a histogram of DNA, mRNA, and protein concentration at various times, where the x-axis represents the number of molecules and the y axis represents the cell count, displaying how many cells at a given time are producing/contain a certain amount of each species. We decided that capturing our histogram as a movie frame would be the best display of our results over time.

Figure 20. Histogram results for stochastic model


Scaling Up Production

Overview

Using mathematical modeling, we also aimed to predict the inherent variability in baculovirus growth and viability which would be useful in determining the optimal operating conditions for fermenters at an industrial scale, in order to increase production of protein S. We were unable to continue protein expression in SF9 cells due to time constraints, but wanted to determine if this system would be achievable for industrial production. Baculoviruses are known to specifically infect invertebrates and contain a circular, double-stranded genome which ranges in length from 80 to 180 kbp (Hoffman et al. 1995). The industrial large-scale production of baculovirus can have a significant impact, with wide-ranging economical implications such as the production of vaccines, biopesticides, or, as in our case, recombinant proteins (Demain and Vaishnav 2009).

In order to accomplish this, we used an integrated model for the baculovirus infection process which simulates the steps of DNA replication, mRNA and protein expression, and polyhedra production. This model was adopted from a dissertation submitted to the Indian Institute of Technology Hyderabad: “Mathematical modelling of baculovirus infection process: Kinetic parameter estimation,” by Suraj Kumar Singh (Singh and Giri 2018).

The model integrates the various dynamics of the infected cells, uninfected cells, standard virus, defective interfering particles (DIPs) and DNA, RNA, protein expression and polyhedral production. In particular, this model accounts for the kinetics of DNA, RNA, proteins and DIP formation. The specific model variables are cell mass, oxygen, carbon dioxide, DNA, RNA, proteins, baculovirus, and DIPs as products.

In this model, there are underlying basic assumptions for the infection process (Singh and Giri 2018):

  1. The budded virus is internalized in the host cell, and it releases its viral genome into that cell
  2. The viral DNA is transported to the surface of the nucleus from the cytosol
  3. The DNA is then replicated and copies are made by taking advantage of the host machinery
  4. RNA is formed in order to produce DNA
  5. Early, late, and very late proteins: FP25K and GP64 are expressed by the RNA
  6. Nucleocapsids are formed, and occlusion-derived virus (ODV) formation occurs
  7. Polyhedrin proteins are transferred to the nucleus by the FP25Kprotein, resulting in the formation of the nuclear polyhedrosis virus (NPV)
  8. The cells explode and the viruses emerge
  9. Viruses are budded along with the surface protein: GP64
  10. Occluded virused are formed through envelopment in ODVs and embedded in polyhedra

Defective interfering particles (DIPs) are also an important part of this system. DIPs are spontaneous deletion mutants of viruses. They can have complex effects on the growth of viruses, thus they were included in the model (Kirkwood et al. 1994).

The infection process described above was represented by a set of coupled ordinary differential equations initial value problems (ODE-IVPs). There were various kinetic parameters which included specific growth rate, death rate, rate of oxygen consumption, the rate of carbon dioxide formation, the maintenance coefficient, product degradation rate, rate of DNA replication, rate of RNA formation, rate of protein formation, etc (Singh and Giri 2018).

This model has nine essential components.

  1. Host cells (SF9)
  2. The standard, wild-type virus
  3. Defective Interfering Particles (DIP’s)
  4. Oxygen
  5. Carbon Dioxide
  6. Substrate
  7. DNA
  8. mRNA
  9. Dead Cells

The number of free DIP and standard virus particles are VD and VS, respectively. The number of uninfected cells, oxygen concentration, carbon dioxide concentration, and substrate concentration, were denoted by CU, 𝑂2, 𝐶𝑂2 and 𝑆 respectively. The initial state was specified by initial values of CU, VD, VS, 𝑂2, 𝐶𝑂2, and 𝑆. The number of uninfected cells infected by DIP viruses are denoted by CD. The number of cells infected by the standard virus are denoted as CS. Finally, the number of cells infected by both standard and DIP viruses are denoted by CB.

The basic underlying assumptions were (Singh and Giri 2018):

  1. Infection of the cell occurs through random contact between virus particles and cells.
  2. The infection rate per cell-virion pair is α, and is constant.
  3. CD cells are formed by infection of CU cells with DIP virus.
  4. CS cells are formed by infection of CU cells with standard (wild-type) viruses.
  5. CB cells are formed by the infection of the DIP virus with a CS cell or the infection of a standard virus with a CD cell, or through co-infection of both the viruses with uninfected cells.
  6. Cell growth is dependent on substrate, oxygen and carbon dioxide concentrations
  7. The CS cell will burst out at the end of the virus generation time; this time is not fixed but we are assuming it is distributed around an appropriate mean generation time
  8. The production of CB cells from a CS cell occurs when a CS cell is co-infected with DIP. But, this conversion of a CS cell to a CB cell occurs if the CS cell is co-infected with a DIP before a certain cutoff time. After that cutoff time, the CS cell will burst, releasing the standard virus.
  9. Conversion from a CD cell to a CB cell occurs when a CD cell gets infected with a standard virus. The CB cell bursts after a virus generation time, resulting in the production of DIPs and the standard virus.

Equations

The equations for our model are as follows:

The first term in equation 1 represents the amount of substrate. The first term in equation 3 represents cell growth, the second term represents cell loss through infection, and the third term represents cell death. In equation 4, the first term also represents cell growth, the second term represents cell gain through infection, the third term represents cell loss through coinfection, and the last term represents cell death.

The age of the cells is defined for the CS and CB cells. For this, the number of subclasses of CS and CB were defined where 𝜏 is a constant representing the duration of a subclass. The maximum time that can elapse before a CS or CB cell bursts and releases the virus was set to be T. The number of subclasses for the CS and CB cells were divided as n = T/𝜏. In the following equations, 𝐶𝑆𝑖 and 𝐶𝐵𝑖 denote the 𝑖𝑡ℎ subclass. The equations were written as follows:

The first term in equation 5 represents cell growth, the second term represents cells gain through infection, the third term represents the production of CS cells based on polyhedrin protein, the fourth term is the rate of transfer of cells from 𝐶𝑆1 to 𝐶𝑆2 through the aging process, the fifth term represents the rate of cell loss, estimating the cells undergoing virus release, and the last term is the rate of cell apoptosis or death. The variable 𝑝𝑖 (𝑖 = 1, 2 …, 𝑛) represents the death rates of cells in the 𝑖𝑡ℎ subclass which are undergoing virus release. The quantity, 𝑧𝑖, indicates the possibility of coinfection with DIPs. We assumed that the conversion is possible (𝑧𝑖 = 1) before a certain time, 𝑖 = 𝑖. Until this time, standard virus generation takes place, and after this time, conversion becomes impossible (𝑧𝑖 = 0).

The equations for the modeling of the CB cells are very similar to those for the CS cells. However, the production of CB cells is possible in two different ways: by infecting CD cells with standard virus or infecting CS cells with DIPs. These equations are as follows:

Ʃ𝐶𝑆𝑖 is the summation of all the subclasses of the CS cells. The rate of formation of the DIPs and the standard viruses were obtained by adding the rates of virus release from each of the subclasses of CS and CB, and then subtracting the rates of virus loss throughout the infection process (Singh and Giri 2018).

Usually, the production of DNA takes place from the standard virus in a very specific time period: 𝑓𝐷𝑁𝐴. For our model, we assumed that DNA production does not take place before or after that time interval at all.

𝑖𝑓(𝑡<𝛿DNAmin) fDNA =0
𝑖𝑓(𝛿DNAmin<𝑡<𝛿DNAmax) 𝑓𝐷𝑁𝐴=1−𝑡−𝛿DNAmin/𝛿DNAmax - 𝛿DNAmin
𝑖𝑓(𝛿DNAmax<𝑡) fDNA=0

Table 10. If statements for DNA production.

The first term in equation 13 represents the rate of DNA production from standard viruses, the second term represents the rate of replication of DNA, and the last term represents the rate of degradation of DNA.

Production of mRNA was modeled considering that there are three different types of RNA: early RNA, late RNA, and very late RNA. The production of early RNA starts before DNA replication, therefore the production of early RNA is mainly dependent on viruses In contrast, the production of late and very late RNA is mainly dependent on the amount of DNA that is present in the nucleus. Time varying functions were included to account for RNA production during specific phases. The equations are as follows:

𝑖𝑓(𝑡<𝛿RNAemin) 𝑓early=0
𝑖𝑓(𝛿RNAemin<𝑡<𝛿RNAemax) 𝑓early=1−𝑡 − 𝛿RNAemin/𝛿RNAemax- 𝛿RNAemin
𝑖𝑓(𝛿RNAemax<𝑡) 𝑓early=0

Table 11. If statements for early RNA production.

𝑖𝑓(𝑡<𝛿RNAlmin) 𝑓late=0
𝑖𝑓(𝛿RNAlmin<𝑡<𝛿RNAlmax) 𝑓late=1−𝑡 − 𝛿RNAlmin/𝛿RNAlmax- 𝛿RNAlmin
𝑖𝑓(15<𝛿RNAlmax) 𝑓late=0

Table 12. If statements for late RNA production.

𝑖𝑓(𝑡<𝛿RNAvlmin) 𝑓verylate=0
𝑖𝑓(𝛿RNAvlmin<𝑡<𝛿RNAvlmax) 𝑓verylate=1−𝑡 − 𝛿RNAvlmin/𝛿RNAvlmax- 𝛿RNAvlmin
𝑖𝑓(𝛿RNAvlmax<𝑡) 𝑓verylate=0

Table 13. If statements for very late RNA production.

In equation 14, the first term represents the production of early RNA, and the second term represents the degradation of RNA. The equations for late and very late RNA were constructed similarly.

In equation 17, the first term represents the production of GP64 protein, and the second term represents the degradation of the protein. We constructed the equations for the late and very late proteins, FP25K and polyhedrin, similarly. We assumed that the amount of GP64 protein is inversely proportional to the amount of FP25k protein, and that the amount of polyhedrin is directly proportional to the amount of the FP25k protein.

In equation 20, the first term represents the mass transfer rate of oxygen into the insect cells, and the second term represents the oxygen consumption rate. In equation 22, the first term represents the rate of death of the uninfected cells and the second term represents the rate of death of infected cells.

The distribution of the virus release time for the CS and CB cells is governed by the variable 𝑝. For our model, we assumed that the 𝑝 value will be zero before 18 hours post-infection because no virus release takes place before this time, and after this time, the virus release will follow an exponential function.

p=0 if(t<18)
p= α.eβ(i-i') Otherwise

Table 14. If statements for virus release time.

𝑖′ is the minimum number of subclasses through which a CS or CB cell must pass before it is first exposed to the risk of cell lysis. The rate of virus release are controlled by the values of 𝛼, 𝛽 and 𝑖′. Also, this model assumes that co-infection does not take place before a certain time.

zi=1 if(t<10)
zi=0 Otherwise

Table 15. If statements for rate of virus release.

Sensitivity Analysis

Sensitivity analysis is a tool that is used to identify the specific parameters that affect the response variables. This technique is useful for finding out the parameter range in which the parameter estimation is to be conducted. We did not complete the sensitivity analysis, as it had already been done by Singh and Giri in 2018. Through their sensitivity analysis, we could obtain the desired parameter range for choosing the initial values. Their sensitivity analysis was performed with ± 100% change in each of the parameters and observed the effect on two variables: cell density and percentage of viable cells. They created a 3-D representation of their parametric sensitivity analysis for this model. This image is shown below. The X axis shows all the different parameters, the Y axis shows all the model variables, and the Z axis shows the change in model variables with respect to the different parameters. The result demonstrates that cell density is sensitive to specific growth rate, the rate of protein transport, and the rate of protein synthesis. The percent of polyhedra per cell is sensitive to specific growth rate, α, and β. Cell viability is sensitive to specific growth rate and the rate of DNA transcription.

Figure 21. Sensitivity Analysis Completed by Singh and Ghiri. The X axis shows all the parameters in the model, the Y axis shows the variables, and the Z axis shows the percentage variation of the variables with parameters (Singh and Giri 2018).

Parameter Definitions

We ran the simulation three times, with three different sets of initial conditions. Each data set had a different amount of initial uninfected cells. We also collected and analyzed parameters for both wildtype and stabilized viruses. However we only chose to model the stabilized virus, using MATLAB. There were a total of 23 parameters and 75 variables in our model.

Variables Uninfected Cells Substrate O2 CO2 Dead Cells Infected Cells Virus DNA RNA Proteins
Initial Condition (data set 1) 460000 10000 10000 0 5150 0 0 0 0 0
Initial Condition (data set 2) 590000 10000 10000 0 5150 0 0 0 0 0
Initial Condition (data set 3) 650000 10000 10000 0 5150 0 0 0 0 0

Table 16. Initial Conditions— Infected Cell for Wild Type Viral Infection

Variables Uninfected Cells Substrate O2 CO2 Dead Cells Infected Cells Virus DNA RNA Proteins
Initial Condition (data set 1) 460000 10000 10000 0 5150 0 0 0 0 0
Initial Condition (data set 2) 590000 10000 10000 0 5150 0 0 0 0 0
Initial Condition (data set 3) 650000 10000 10000 0 5150 0 0 0 0 0

Table 17. Initial Conditions— Infected Cell for Stabilized Viral Infection.

Parameters Parameter Name WildType Virus DS1 WildType Virus DS2 WildType Virus DS3 Stabilized Virus DS1 Stabilized Virus DS2 Stabilized Virus DS3
a (gL-1) Infection Rate Constant 1.94 x 10-5 1.94 x 10-5 1.94 x 10-5 1.82 x 10-5 1.82 x 10-5 1.82 x 10-5
b (gL-1) Virus Production Rate 21.57 21.57 21.57 23.14 23.14 23.14
Ks(gL-1) Half-Velocity Constant 50.00 50.00 50.00 106.42 106.42 106.42
K(gL-1) Saturation Constant 0.628 0.628 0.628 0.971 0.971 0.971
KO2 (gL-1) Half-Velocity Constant 430.379 430.379 430.379 493.601 493.601 493.601
KCO2 (gL-1) Half-Velocity Constant 14.913 14.913 14.913 14.702 14.702 14.702
Ys(gL-1) Yield of Cell/Substrate 1.2 x 105 1.2 x 105 1.2 x 105 1.12 x 105 1.12 x 105 1.12 x 105
YO2(gL-1) Yield of Cell/Oxygen 8.0 x 105 8.0 x 105 8.0 x 105 8.0 x 105 8.0 x 105 8.0 x 105
kla (gL-1) Mass Transfer Rate Constant 1.0 x 10-5 1.0 x 10-5 1.0 x 10-5 5.94 x 10-5 5.94 x 10-5 5.94 x 10-5
kDNA (gL-1) DNA Replication Constant 1.95 1.95 1.95 3.80 3.80 3.80
kdeg(gL-1) Degradation Rate Constant 0.00728 0.00728 0.00728 0.0132 0.0132 0.0132
O* (gL-1) Equilibrium Oxygen Concentration 2.699 2.699 2.699 1.073 1.073 1.073
kRNA(gL-1) Transcription Rate Constant 5.766 5.766 5.766 3.505 3.505 3.505
Kpoly(gL-1) Equilibrium Constant for fp25k 0.500 0.500 0.500 1.179 1.179 1.179
Kfp(gL-1) Equilibrium Constant for Polyhedrin 0.008 0.008 0.008 0.008 0.008 0.008
kpe(gL-1) Protein Synthesis Rate Constant 0.088 0.088 0.088 0.0768 0.0768 0.0768
KRNA(gL-1) Half-Saturation Constant for Protein 16.05 16.05 16.05 27.37 27.37 27.37
ktrans(gL-1) DNA Transfer Constant 0.00898 0.00898 0.00898 0.0006 0.0006 0.0006
⍺ (---) Virus Release Rate Constant 0.001 0.001 0.001 0.001 0.001 0.001
𝛃 (hour-1) Bursting Rate Constant 0.501 0.501 0.501 0.307 0.307 0.307
μmax(hour-1) Maximum Specific Growth Rate 0.87 1.084 1.088 0.93 1.09 1.013
kd1(hour-1) Intrinsic Death Rate 0.15 0.001 0.0014 0.03 0.0072 0.13
k* (hour-1) Death Rate due to Infection 0.0049 0.0049 0.0049 0.001 0.001 0.001

Table 18. Estimated Parameters.

Parameter (Wild Type Virus) DS1 DS2 DS3 Mean and Standard Deviation
μmax (hour-1) 0.87 1.084 1.088 1.014 ± 0.124
kd1(hour-1) 0.15 0.001 0.0014 0.0508 ± 0.49
Parameter (Stabilized Virus) DS1 DS2 DS3 Mean and Standard Deviation
μmax (hour-1) 0.93 1.09 1.013 1.011 ± 0.0462
kd1(hour-1) 0.03 0.0072 0.13 0.0557 ± 0.03777

Table 19. Parameter Values for Three Different Simulations with Mean and Standard Deviation.

Results

Figure 22. Infected cell density vs. time plot.

Figure 23. Infected cell density vs. time plot.

Figure 24. Infected cell density vs. time plot.

Figure 25. % of polyhedra/cell vs. time plot.

Figure 26. % of polyhedra/cell vs. time plot.

Figure 27. % of polyhedra/cell vs. time plot.

Figure 28. Cell viability vs. time plot.

Figure 29. Cell viability vs. time plot.

Figure 30. Cell viability vs. Time Plot

Simulated results showing infected cell growth for stabilized virus and triplicate data set; red lines show cell density with time; green shows percentage of polyhedra/cell with respect to time; blue represents the % of viable cells with time.

We ran the simulation three times, with three different sets of initial conditions for comparison, and to explore which initial conditions are optimal for protein production. Each data set had a different amount of initial uninfected cells. We also collected and analyzed parameters for both wildtype and stabilized viruses. However we only chose to model the stabilized virus, using MATLAB. There were a total of 23 parameters and 75 variables in our model. The first set (red) of graphs shows the cell density of infected cells over time, the second set (green) shows the % of polyhedra per cell over time, and the last set (blue) shows the cell viability of the infected cells over time.

Analysis

What We Learned From Modeling Cell Density

We simulated three models with different initial conditions for each and obtained the corresponding growth patterns. The following basic assumptions regarding the baculovirus infection process are important to analyze the cell density changes and growth patterns of SF9 cells post infection:

  1. Growth and cell density of SF9 cells are dependent on first order death rate for a certain period and then on the number of viral particles replicated in the cell.
  2. Other factors affecting cell growth patterns include initial cell mass, available substrate concentration, oxygen depletion rate, carbon dioxide formation and number of viral particles produced.

The Baculovirus infection cycle can be divided into the following phases, and the corresponding activities occuring in each phase directly the host cell growth and death patterns:

  1. Early phase: Starting in the host cell nucleus, viral transcription begins around 30 min post infection by exploiting host cell machinery. Most of the viral genes transcribed in this phase are transcriptional factors and regulatory proteins which will play an important role in later infection phases. During this phase, host RNA polymerase is being used for viral replication, however, host gene expression and other functions are still occurring normally. This is evident in our models as the cell density keeps increasing during the early phase of infection which can last up to 6 hours. Cell growth is supported by the available substrate and oxygen availability (Putman, Z. 2020, November 15).
  2. Late Phase: This phase typically occurs between 6 and 15 hours post infection and is marked by a decline of host gene expression. Late viral genes are transcribed using a baculovirus RNA polymerase and encode proteins that are critical for producing budded viral particles and nucleocapsid proteins. While the viral replication increases and host mRNA levels steadily decrease, SF9 cells still undergo normal cell cycle based on availability of resources.
  3. Very Late Phase: Beginning around 18 hours post infection and continuing over 70 hours post infection, this phase marks expression of very late genes which encode for polyhedrin proteins and other products essential for virus maturation and formation of occlusion bodies. During this phase, viral proteins and structures necessary for virus assembly are synthesized and formation of occlusion bodies is initiated. At the end of the cycle, the host cell undergoes cell lysis, releasing the virus particles into their environment. This causes a sharp decline in cell density, as seen in our modeled simulations. The variation in the relationship between hours post infection and the starting decline in cell density may be accounted for by different initial cell densities, multiplicity of infection, culturing conditions, etc.

What We Learned From Modeling % Polyhedra per Infected Cell

One of the most important very late phase promoters, polyhedrin, is involved in the hyper-expression of the protein polyhedrin, a crystalline structure encapsulating the newly formed virions and forming occlusion bodies. This crystalline matrix protein protects the virus from the harsh environment, until they are ingested by insects and the polyhedrin structure is broken down due to the alkaline environment of insects midgut.

While many mechanisms of expression through the polyhedrin promoter are unknown, it is known that the polyhedrin promoter binding protein (PPBP) and the very late factor 1 (VLF-1) play an important regulatory role by binding to the polyhedrin promoter and leading to burst in the expression of very late phase proteins, such as polyhedrin. As a result, large amounts of polyhedrin is produced, which accumulates in the cytoplasm and forms crystals called polyhedra. This promotes the maturation of occluded virions and lysis of the cell. The curve of % polyhedra vs. time plateaus after a certain time post infection, indicating cell lysis and completion of the infection cycle.

What We Learned From Modeling Cell Viability

Cell viability is a measurement for calculating the proportion of healthy cells in a population of cells. Cell viability is an important parameter to consider when working with baculovirus expression systems, because it is directly proportional to the amount of recombinant protein produced.

During a baculovirus infection cycle, the cell viability decreases gradually as the virus takes over host machinery and completely diminishes the host genome expression, eventually leading to host cell lysis, which causes the curve of cell viability vs. time to approach 0. Manipulating and increasing the cell viability during a baculovirus infection system has been an important area of research because of its correlation with recombinant protein production. Researchers have tried to increase cell viability by working with different insect cell lines, manipulating culturing conditions, applying a non-lytic virus infection cycle, expressing foreign gene proteins that increase cell viability, and temperature oscillation during insect cell culture, etc.

Cell viability is significantly increased in cells infected with the vankyrin‐encoding baculoviruses as early as day 2 post‐infection, and at 3 days post infection cell viability is more than twice as high in cells infected with baculovirus harboring the vankyrin gene compared to cells infected with control viruses lacking the vankyrin gene (Steele et al., 2017).

Temperature oscillation can also enhance cell viability of SF9 insect cells and baculovirus production of occlusion bodies (OB) and extracellular virus (ECV) compared with constant temperature in stationary culture and suspension culture (Shao-Hua et al., 1998).

Our modeling helped give insight into some of the optimal operating conditions for cell viability, cell density, and % polyhedra, giving us more insight into how to possibly scale up the infection process to produce more recombinant protein.

Enhancing Recombinant Protein Yield Through Baculovirus Systems

In general, our modeling gave us insight into how an increase in yield of recombinant protein can be achieved by optimizing the infection process by manipulating parameters such as initial cell density, MOI, TOI, time of harvest, cell cycle kinetics and culture states used for infection. Since MOI determines the ratio of infection agents to host cells and is related to cell concentration, it shows a significant correlation with recombinant protein yield. The effect of low and high MOI on protein expression is studied in (Zitzmann et al., 2017) and it states that using a low MOI is more effective for industrial scaling of cell culture and protein expression. A low MOI requires smaller viral stock volumes and is easier to achieve. Moreover, using a low MOI results in a two-stage infection cycle, a budded virus stage (occurs during early infection stages) and an occluded virus stage (late infection stage).

In the budding stage, only a proportion of host cells are infected, in which the virus particles are formed and are exocytosed into the environment through fusion with the host cell membrane. During this stage, the host cells are releasing new virions to infect other insect cells and also producing recombinant protein while normally replicating themselves. This gives a higher cell density at the start of the second stage of infection, occluded virus production. During the occluded virus stage, late and very late genes undergo a rapid burst of expression, producing proteins and structures necessary for the maturation of occluded virions, which begin to accumulate in the cytoplasm and are eventually released when the host cell is lysed. This can be observed in our model.

Majority of the recombinant protein expression is driven by the most powerful baculovirus promoter, polyhedrin, which is active in the late and very late stages of infection. Several studies have been done to study the mechanism of the polyhedrin promoter to better characterize its transcription activity. A host secreted transcription factor, polyhedrin promoter binding protein (PPBP), is known to bind a specific sequence on the polyhedrin promoter with extremely high affinity and specificity and plays a major role in the level of transcription through the polyhedrin promoter. Further studying the PPBP-DNA interactions and manipulating the concentration of PPBP in host SF9 cells can have a significant impact on the yield of recombinant proteins.


Overall Conclusions From our Modeling

Our dry lab team decided from the beginning of our project that the most critical component would be that our data and corresponding analysis were comprehendible by future iGEM teams as well as individuals with no background in mathematical modeling. Having limited experience ourselves, we saw this as an opportunity beyond our project goals to use what we learned, to teach the value of using mathematical modeling in conjunction with wet lab to increase productivity.

It was important for us to establish a basic understanding that would prove to be the foundation for all of the successive mathematical models that we chose to construct. For our first model, we chose to write a system of reactions based on the central dogma of biology for a basic model of constitutive gene expression. Applying the law of mass action, we we wrote the following ordinary differential equations:

For our model, DNA concentration remained constant because the rate of change of DNA is zero. After defining our parameters using a literature search and converting our rate constants to M/s, we were able to examine the steady state behavior by setting our ODE’s to zero. We input our script into MATLAB, where we determined through a protein vs. time plot that E.coli would yield a higher concentration of protein S than Sf9 cells.

For our next model, we wanted to build on what we learned in constructing our first model, to alter DNA concentration. We were still able to use our same system of ODE’s while using a For loop in MATLAB to simulate the protein production output based on our initial DNA concentration of 0.00077042 mM and multiplied that by 1, 2, 50, 100, 200, 500, 1000, 5000. Following extraction of the steady states for protein production, we calculated the mean of the last five points and plotted the results. We saw a linear increase in both E.coli and Sf9 protein levels as we increased initial DNA concentration. After working on this simulation, we decided to begin working on modeling gene regulatory networks for SF9 and E.coli.

For SF9 cells, we modeled the interaction between the polyhedrin promoter binding protein (PPBP) and the polyhedrin promoter showing its effect on the rate of production of mRNA and the resulting protein S. After conducting an extensive literature search, we were unable to locate established values for the concentration/number of molecules of the PPBP and therefore had to estimate the concentration based on common transcription factors found in Drosophila Melanogaster. To model the regulatory gene circuit in SF9 cells, we utilized the Hill function for an activator as follows:

The parameter estimation for this particular model was difficult and we were unable to identify the Michaelis Menten constant for the PPBP-polyhedrin promoter interaction and had to use the dissociation constant of the PPBP-Polyhedrin complex that we located in literature. Using the following set of ODE’s:

we were able to determine the expected levels of protein S expressed from the regulated Sf9 circuit upon analysis of the steady state behavior in this system. The predicted level of protein S from this circuit was lower than expected. We suspect parameter estimation errors for the transcription factor PPBP concentration and the polyhedrin promoter in the infected cell. Further studies on the steady state concentrations and dynamics of the PPBP and polyhedrin promoter interaction will provide more accurate values leading to a more precise mathematical model.

Following this, we sought to model the E.coli Lac Operon inducible gene network using the repressor Hill function, h- = α(1/1+(x/K)n). Specifically, we modeled the T7 promoter regulatory gene circuit which uses a repressor protein for transcriptional control. IPTG, or Isopropyl β-D-1-thiogalactopyranoside, resembles allolactose and binds the lac repressor protein. This binding allows for the T7 RNA polymerase to bind to the promoter and begin transcription of the downstream gene sequence. We wrote the following ODE’s to model this network:

Our simulated plots in MATLAB revealed that there was an increase in mRNA levels and corresponding protein levels following induction using IPTG. This particular model was an invaluable resource for our wet lab design. We were able to gain a better understanding of the circuit kinetics of this gene network. We used 0.5 mM IPTG, which was the initial steady state concentration for this model, to induce E.coli origami B cells which ultimately led to successful expression of protein S.

Modeling the network dynamics of complex biological systems requires knowledge of the molecular interactions for the most accurate simulation. With our stochastic model, constructing the molecular level of our cells was necessary to mimic the intricacies of protein S expression from cell concentration. The Gillespie algorithm allows us to model the details at the molecular level to determine the probability of a system, or a cell in our case, occupying a particular state in time. We use this to predict the variability of the state distribution over time, using the appropriate timescale (Cole, 2020). After initializing our cells, determining the timestep, and using a for loop to update our cells, we decided that the most comprehensive way to present the results of our model would be through a histogram that displays the protein concentration from cell counts as a movie frame over time.

While using a stochastic model is useful in simulating the molecular details in a system, there are limitations as well. One major restraint was the computational processing power that was required to complete the model. We were forced to adjust the time step to allow the model to run in a sufficient amount of time, and because of this the accuracy of cell count, DNA, mRNA, and protein amounts may have been skewed. Ultimately, using this stochastic model to project the probability distributions of our system to capture the complexities of protein S expression in MATLAB gave us the foundation that we needed for our final model which will be used to scale up production of protein S on an industrial level.

The initial aim of our project was to successfully express protein S in E.coli and Sf9 cells. Unfortunately, due to time constraints, we were unable to continue working in Sf9 cells and thought it would be most beneficial, based on our mathematical models, to focus on E.coli cells. Even though we were not able to continue working with SF9 cells in wet lab, in terms of industrial production, they proved to be an integral component of our dry lab. In order to determine the optimal conditions for the fermentation process at an industrial scale, our goal was to predict the inherent variability in baculovirus growth and viability to increase protein S production.

Our industrial scaled model was adapted from a dissertation submitted to the Indian Institute of Technology Hyderabad: “Mathematical modelling of baculovirus infection process: Kinetic parameter estimation,” by Suraj Kumar Singh (Singh and Giri 2018). This model integrates the dynamics of infected and uninfected cells, standard virus, defective interfering particles (DIPs), DNA, RNA, protein expression and polyhedral production, while accounting for the kinetics of DNA, RNA, proteins, and DIP formation.

This model has nine essential components.

  1. Host cells (SF9)
  2. The standard, wild-type virus
  3. Defective Interfering Particles (DIP’s)
  4. Oxygen
  5. Carbon Dioxide
  6. Substrate
  7. DNA
  8. mRNA
  9. Dead Cells

We used a set of coupled ordinary differential equations initial value problems (ODE-IVPs) to represent the baculovirus infection process, modeling cell density, percent polyhedra per infected cell, and cell viability. We used kinetic parameters such as specific growth rate, death rate, rate of oxygen consumption, the rate of carbon dioxide formation, the maintenance coefficient, product degradation rate, rate of DNA replication, rate of RNA formation, and rate of protein formation. We ran the simulation in MATLAB three times using three different sets of initial conditions, each data set had a different amount of initial uninfected cells to establish cell growth patterns. Modeling cell viability, we were able to conclude the amount of recombinant protein that would be produced because that number is directly proportional to cell viability in baculovirus expression systems. This modeling gave us insight into how to optimize the baculovirus infection process to scale up our protein production to an industrial level for the entrepreneurial aspect of our project, and make our proposed product more cost-effective and widely available.


Final Thoughts

All our models can be greatly improved upon to be more accurate and provide even more comprehensive information about our project. In the future, we hope to refine our simulations and methodologically address the weaknesses in some aspects of our modeling, which were defined above. However, despite these drawbacks, our modeling did prove effective towards the goals of our project. The work outlined on this page gave us various insights into our project in terms of both experimental design and plans for future research. The results of our simulations helped us decide which cells were more feasible to express in (particularly over a shorter period of time), what concentration of IPTG to use in order to induce proper expression, and how to scale up our project to an industrial scale. Without the iterative feedback from our modeling, which was thoroughly incorporated into our wet lab experimental design, we would not have been able to achieve everything we accomplished over the course of the past year. Modeling was fundamental to the success of our project, and we hope future iGEM teams can build on what we’ve done.

To this extent, and in order to make our work transparent and easily accessible to everyone, we have attached the MATLAB code for each of our models in the links below. These scripts, unless otherwise indicated, were written by Lori Saxena, Stephanie Laderwager, and Maulik Masaliya from the 2022 Stony_Brook iGEM team.


References

Aronson, J. K., & Ferner, R. E. (2016). The law of mass action and the pharmacological concentration-effect curve: resolving the paradox of apparently non-dose-related adverse drug reactions. British journal of clinical pharmacology, 81(1), 56–61. https://doi.org/10.1111/bcp.12706.

Beadle, L. F., Love, J. C., Shapovalova, Y., Artemev, A., Rattray, M., & Ashe, H. L. (2022). Modeling global mrna dynamics during drosophila embryogenesis reveals a relationship between mrna degradation and P-Bodies. BioRxiv. https://doi.org/10.1101/2022.03.17.484585.

Bloch, E. D. (2011). Limits to Infinity. In The Real Numbers and Real Analysis (pp. 321–355). New York, NY: Springer New York.

Burma S, Mukherjee B, Jain A, Habib S, Hasnain SE. An unusual 30-kDa protein binding to the polyhedrin gene promoter of Autographa californica nuclear polyhedrosis virus. J Biol Chem. 1994 Jan 28;269(4):2750-7. PMID: 8300607.

Chang, J. C., Lee, S. J., Kim, J. S., Wang, C. H., & Nai, Y. S. (2018). Transient Expression of Foreign Genes in Insect Cells (sf9) for Protein Functional Assay. Journal of visualized experiments : JoVE, (132), 56693. https://doi.org/10.3791/56693

Demain, A. L., & Vaishnav, P. (2009). Production of recombinant proteins by microbes and higher organisms. Biotechnology advances, 27(3), 297–306. https://doi.org/10.1016/j.biotechadv.2009.01.008

Fukaya, T., Lim, B., & Levine, M. (2017). Rapid Rates of Pol II Elongation in the Drosophila Embryo. Current biology : CB, 27(9), 1387–1391. https://doi.org/10.1016/j.cub.2017.03.069

García, L. R., & Molineux, I. J. (1995). Rate of translocation of bacteriophage T7 DNA across the membranes of Escherichia coli. Journal of bacteriology, 177(14), 4066–4076. https://doi.org/10.1128/jb.177.14.4066-4076.1995.

Ghosh, S., Jain, A., Mukherjee, B., Habib, S., & Hasnain, S. E. (1998). The host factor polyhedrin promoter binding protein (PPBP) is involved in transcription from the baculovirus polyhedrin gene promoter. Journal of virology, 72(9), 7484–7493. https://doi.org/10.1128/JVI.72.9.7484-7493.1998

Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. The Journal of Physical Chemistry, 81(25), 2340–2361. https://doi.org/10.1021/j100540a008

(2020), L., Cole. (2020, April 14). Gillespie Algorithm. Retrieved September 25, 2022, from Lewis Cole Blog website: https://lewiscoleblog.com/gillespie-algorithm

Gillespie_algorithm. (n.d.). Retrieved August 23, 2022, from Chemeurope.com website: https://www.chemeurope.com/en/encyclopedia/Gillespie_algorithm.html

Guet, C. C., Bruneaux, L., Min, T. L., Siegal-Gaskins, D., Figueroa, I., Emonet, T., & Cluzel, P. (2008). Minimally invasive determination of mRNA concentration in single living bacteria. Nucleic acids research, 36(12), e73. https://doi.org/10.1093/nar/gkn329.

Hausser, J., Mayo, A., Keren, L., & Alon, U. (2019). Central dogma rates and the trade-off between precision and economy in gene expression. Nature Communications, 10(1), 68. doi:10.1038/s41467-018-07391-8

He, L., Binari, R., Huang, J., Falo-Sanjuan, J., & Perrimon, N. (2019). In vivo study of gene expression with an enhanced dual-color fluorescent transcriptional timer. eLife, 8, e46181. https://doi.org/10.7554/eLife.46181.

Hockley, M. (2021, October 7). Why is the time-step so important? Retrieved September 5, 2022, from NanoMedic website: https://nanomedic.org/why-is-the-time-step-so-important/

Hofmann, C., Sandig, V., Jennings, G., Rudolph, M., Schlag, P., & Strauss, M. (1995). Efficient gene transfer into human hepatocytes by baculovirus vectors. Proceedings of the National Academy of Sciences of the United States of America, 92(22), 10099–10103. https://doi.org/10.1073/pnas.92.22.10099

Jackson, D. A., Pombo, A., & Iborra, F. (2000). The balance sheet for transcription: an analysis of nuclear RNA metabolism in mammalian cells. FASEB journal : official publication of the Federation of American Societies for Experimental Biology, 14(2), 242–254.

Kirkwood, T. B., and C. R. Bangham. "Cycles, chaos, and evolution in virus cultures: a model of defective interfering particles." Proceedings of the National Academy of Sciences91.18 (1994): 8685-8689.

Li, G. W., & Xie, X. S. (2011). Central dogma at the single-molecule level in living cells. Nature, 475(7356), 308–315. https://doi.org/10.1038/nature10315.

Luby-Phelps K. Cytoarchitecture and physical properties of cytoplasm: volume, viscosity, diffusion, intracellular surface area. Int Rev Cytol. 2000;192:189-221. doi: 10.1016/s0074-7696(08)60527-6. PMID: 10553280.

Maurizi M. R. (1992). Proteases and protein degradation in Escherichia coli. Experientia, 48(2), 178–201. https://doi.org/10.1007/BF01923511

Olofsson, S. O., Boström, K., Carlsson, P., Borén, J., Wettesten, M., Bjursell, G., Wiklund, O., & Bondjers, G. (1987). Structure and biosynthesis of apolipoprotein B. American heart journal, 113(2 Pt 2), 446–452. https://doi.org/10.1016/0002-8703(87)90612-0.

Owen, M. (n.d.). Gstdmb 2012: Dynamical modeling for biology and medicine. Retrieved August 4, 2022, from Nottingham.ac.uk website: https://www.maths.nottingham.ac.uk/minisites/mathsforlife/gstdmb2012/lect2_ModellingWithDifferentialEquations.pdf

Putman, Z. (2020, November 15). Evaluating modified polyhedrin promoters for improved co-expression of KRAS4b. Maryland Shared Open Access Repository Home. Retrieved August 2, 2022, from https://mdsoar.org/handle/11603/20062

Shao-Hua, C., Hong-Liang, S., & Zuo-Hu, L. (1998). Effect of temperature oscillation on insect cell growth and baculovirus replication. Applied and environmental microbiology, 64(6), 2237–2239. https://doi.org/10.1128/AEM.64.6.2237-2239.1998

Singh, S. K., & Giri, L. (2018). Mathematical modelling of baculovirus infection process: Kinetic parameter estimation (Doctoral dissertation, Indian Institute of Technology Hyderabad).

Steele, K. H., Stone, B. J., Franklin, K. M., Fath-Goodin, A., Zhang, X., Jiang, H., Webb, B. A., & Geisler, C. (2017). Improving the baculovirus expression vector system with vankyrin-enhanced technology. Biotechnology progress, 33(6), 1496–1507. https://doi.org/10.1002/btpr.2516

Stochastic modeling - math terms & solutions - maplesoft. (n.d.). Retrieved August 9, 2022, from Maplesoft.com website: https://www.maplesoft.com/ns/math/stochastic-modeling.aspx

Taniguchi, Y., Choi, P. J., Li, G. W., Chen, H., Babu, M., Hearn, J., Emili, A., & Xie, X. S. (2010). Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science (New York, N.Y.), 329(5991), 533–538. https://doi.org/10.1126/science.1188308.

Tomlin, C., Axelrod, J. Biology by numbers: mathematical modeling in developmental biology. Nat Rev Genet 8, 331–340 (2007). https://doi.org/10.1038/nrg2098

Vogel, C., & Marcotte, E. M. (2012). Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nature reviews. Genetics, 13(4), 227–232. https://doi.org/10.1038/nrg3185

Walter, J., Dever, C. A., & Biggin, M. D. (1994). Two homeo domain proteins bind with similar specificity to a wide range of DNA sites in Drosophila embryos. Genes & development, 8(14), 1678–1692. https://doi.org/10.1101/gad.8.14.1678

Xu, J., & Matthews, K. S. (2009). Flexibility in the inducer binding region is crucial for allostery in the Escherichia coli lactose repressor. Biochemistry, 48(22), 4988–4998. https://doi.org/10.1021/bi9002343

Zhang, Y., Enden, G., Wei, W., Zhou, F., Chen, J., & Merchuk, J. C. (2020). Baculovirus transit through insect cell membranes: A mechanistic approach. Chemical engineering science, 223, 115727. https://doi.org/10.1016/j.ces.2020.115727

Zitzmann, J., Sprick, G., Weidner, T., & Czermak, C. S. P. (2017). Process Optimization for Recombinant Protein Expression in Insect Cells. In (Ed.), New Insights into Cell Culture Technology. IntechOpen. https://doi.org/10.5772/67849

Links to COVID-19

Coronavirus disease (COVID-19) is an infectious disease caused by the SARS-CoV-2 virus. This disorder is widely characterized as a respiratory disease, however this is not necessarily accurate. COVID-19 patients experience a variety of symptoms including pneumonia, inflammation, micro-vasculature dysfunction, hyper-coagulation, nervous system damage and multi-organ failure. Moreover, there are multiple long-term complications, which are still not well-characterized.

Any of these symptoms can be responsible for COVID-19 related mortality. However, thrombosis (abnormal blood-clotting) is the leading cause of death in severe COVID-19 infections. Researchers have linked these abnormal blood-clotting events in COVID-19 with a decline in protein S levels.

Protein S is a multifunctional protein which maintains normal and healthy blood clotting activity, and prevents overcoagulation of the blood. It also inhibits inflammation following cell death and infection, and is used in many cell-signaling pathways. Because protein S has so many different roles, protein S deficiency can have grave consequences.

There are multiple causes for the decline in protein S levels following a COVID-19 infection. In severe COVID-19 infections, there is an extreme immune response, known as thrombo-inflammation, which occurs by overproduction of cytokines, small molecules which regulate the activity of cells.

A decline in protein S levels during COVID-19 can also be explained by the structural similarity between protein S and the spike protein found on the surface of the virus. This spike protein is recognized by the immune system. Following an infection, the immune system produces antibodies against the spike protein to neutralize the virus, but these antibodies can attack the structurally similar parts of protein S, leading to a decline in its levels.

Overall, severe COVID-19 infections have been shown to cause a decline in protein S levels. However, research has also shown that in patients who already have genetic protein S deficiency prior to any infection, the effects are much worse. In a COVID-19 case report, preliminary protein S deficiency has been shown to cause ischemic stroke.

Essentially, it can be concluded that severe COVID-19 infections cause a significant decrease in protein S levels, and increase the incidence of abnormal or fatal blood clotting and inflammation. Researchers studying the direct correlation between protein S deficiency and COVID-19 suggest that administration of protein S in severe COVID-19 patients can be an effective therapy, and can be used as an alternative to traditional treatment. Administering protein S can also be used as a preventative therapeutic for patients with protein S deficiency in order to prevent the occurrence of stroke.

Overall, administering protein S can potentially be an effective treatment for severe COVID-19, and there are many links between protein S and the viral disease. More research should be done to learn about this connection and explore protein S as a therapeutic treatment for severe COVID-19 infections.


Our Approach

Because our project aims to administer protein S directly into the bloodstream, we sought to elucidate possible risks to patients. After reading about the relations between COVID-19 and protein S deficiency, as well as the hypotheses that there might be structural similarities between the spike protein on the virus and protein S, we wanted to investigate the risk of a possible immune response in the administration of protein S. To this extent, we collaborated closely with the iGEM EmpireGene team to model the structures of these two molecules and determine their similarity. We would like to thank the EmpireGene team for their invaluable contributions to this aspect of our project!


Results

The following results were collected and written up by the EmpireGene Team and sent to us in the following document; we simply recopied them below.

Structure of SAR-CoV2

The virion is 60–140 nanometres (2.4×10−6–5.5×10−6 in) in diameter; its mass within the global human populace has been estimated as being between 0.1 and 10 kilograms. Like other coronaviruses, SARS-CoV-2 has four structural proteins, known as the S (spike), E (envelope), M (membrane), and N (nucleocapsid) proteins; the N protein holds the RNA genome, and the S, E, and M proteins together create the viral envelope. Coronavirus S proteins are glycoproteins and also type I membrane proteins (membranes containing a single transmembrane domain oriented on the extracellular side). They are divided into two functional parts (S1 and S2). In SARS-CoV-2, the spike protein, which has been imaged at the atomic level using cryogenic electron microscopy, is the protein responsible for allowing the virus to attach to and fuse with the host cell’s membrane; specifically, its S1 subunit catalyzes attachment, the S2 subunit fusion.


Sequence and Structural Similarity Between COVID-19 and Protein S

Protein S = 1z6c

COVID-19 = 1P9U

T-coffee alignment showed no sequence alignment between two structures. There was no structural similarity as well. The NMR structure of protein S contains loops in its structure along with beta sheets whereas Coronavirus Main Proteinase (3CLpro) Structure contains both secondary subunits.

Overall Results: no structural similarity between proteins and COVID-19.


Conclusion

Overall, there was no structural similarity between protein S and COVID-19. First, the structure of the Sar-CoV2, was selected and superimposed its structure on the NMR structure of protein S. It was found that there were no common residues. Then, the sequences were compared and it was found that there were also no similarities between them. Looking at the literature, a hypoxia environment that arises during the osnet of severe COVID-19 might cause mutations or deactivate protein S, causing structural changes. Furthermore, structural components of the two molecules may have no bearing on the link between COVID-19 and protein S deficiency, and there may be any number of other mechanisms leading to a decline in protein S levels during the disorder. This reduces possible risk of an immune response in the administration of protein S directly into the human bloodstream.


References

Ali, L., Jamoussi, H., Kouki, N., Fray, S., Echebbi, S., Ben Ali, N., & Fredj, M. (2021). COVID-19 Infection and Recurrent Stroke in Young Patients With Protein S Deficiency: A Case Report. The neurologist, 26(6), 276–280. https://doi.org/10.1097/NRL.0000000000000367

Faiez, A. . (2020). Role of S protein in thromboembolic complications during COVID19 and activated protein C as a serious therapeutic avenue in severe forms of patients. Journal of Applied and Natural Science, 12(2), 88–90. https://doi.org/10.31018/jans.vi.2251

Chatterjee, S., Sengupta, T., Majumder, S., & Majumder, R. (2020). COVID-19: a probable role of the anticoagulant Protein S in managing COVID-19-associated coagulopathy. Aging, 12(16), 15954–15961. https://doi.org/10.18632/aging.103869

Pilli, V. S., Plautz, W., & Majumder, R. (2016). The Journey of Protein S from an Anticoagulant to a Signaling Molecule. JSM biochemistry and molecular biology, 3(1), 1014.

Background

Venous thromboembolism (VTE), consists of DVT and/or PE, and has an annual incidence of 300,000 to 900,000 cases in the United States every year, making it a significant cause of mortality and morbidity (Beckman et al. 2010, Raskob et al. 2010). African Americans have a 30–60% higher incidence of VTE than individuals of European descent (Roberts et al. 2009, Zakai and McClure 2011). However, the typical genetic risk factors in populations of European descent are nearly absent in African Americans, and population-specific genetic factors influencing the higher VTE rate are not well characterized at all.

In African American populations, the variant PROS1 V510M, which also causes protein S deficiency and is associated with increased VTE, has a population prevalence from 0.5% to 1.42% among African Americans and African descent populations, and is virtually absent in other populations (Daneshjou et al. 2016).

Despite this, genetic risk factors for this disorder, which have been thoroughly studied in European populations, have not been well characterized in African American communities or other populations (Daneshjou et al. 2016). African American representation in clinical studies is necessary to gain a more comprehensive understanding of how VTE and protein S deficiency affect this community.

We sought to help increase the genetic characterization and understanding of this disorder in African American communities. After reaching out to our local African American communities, and hearing about how widely and deeply impacted they were by issues such as protein S deficiency and VTE, we wanted to take a step forward in research that was designed to address their needs. We decided to specifically model the PROS1 V510M mutation, a mutation that causes protein S deficiency and is found primarily in African American populations. This serves as a foundation for additional characterization of this genetic variant and a starting point for studies that better address the needs of the African American population.


Our Approach

After a thorough literature search, we learned more about PROS1 V510M, and that this variant changes a valine to methionine at position 510 of the PROS1 gene (Daneshjou et al. 2016) and is predicted to be “damaging” and “possibly damaging” (Kumar et al. 2009, Adzhubei et al. 2010). In order to better understand this variant, we sought to predict the difference in thermodynamic stability between the protein variants. This is crucial for protein design and better understanding of genotype-phenotype relationships (Pancotti et al. 2022).

In order to better understand this variant, we sought to predict the difference in thermodynamic stability between the protein variants. This is crucial for protein design and better understanding of genotype-phenotype relationships (Pancotti et al. 2022).

The problem of predicting protein stability changes upon variation of a single residue is not simple, and the effectiveness of this method is affected by several experimental limitations and computational issues (Potapov et al. 2009). Understanding the impact of mutations which lead to the disruption of protein function is fundamental for describing the molecular mechanisms of various different human diseases (Hartl 2017). In fact, specific protein stability perturbations have already been associated with specific pathogenic missense variants (Compiana and Capriotti 2013). Variations have been shown to contribute to the loss of function in certain genes (Birolo et al. 2021) and to modulate drug resistance in several different diseases (Pires et al. 2016). In this context, tools that can predict the effects of mutations on protein stability are crucial in order to infer their pathogenicity.

The effects of non-synonymous variants on protein stability are quantified in terms of the Gibbs free energy of unfolding (⁠ΔG⁠). The stability change from a mutated (⁠M⁠) protein to its wild-type (⁠W⁠) form is defined as the difference of the corresponding unfolding free energy ΔΔG of two proteins:

ΔΔGMW = ΔGM - ΔGW

This value is commonly measured in kcal/mol and the sign indicates whether the variation decreases (destabilizing) or increases (stabilizing) the protein stability.

So far, several computational tools have been created to address this task. We decided to use three different prediction methods to gauge changes in thermodynamic stability:

  1. Dynamut and Dynamut2: machine learning methods implementing a consensus prediction. They combine the effects of mutations on protein stability and dynamics calculated by DUET, Bio3D and ENCoM to generate an optimized and more robust predictor (Rodrigues et al. 2021).
  2. SAAFEC-SEQ: gradient boosting decision-tree machine learning method that uses physico-chemical properties, sequence features and evolutionary information to predict the ΔΔG values (Li et al. 2021).
  3. MUpro: sequence-based SVM-based approach which considers the local mutation environment encoding the residues in a window centered on the target residue. The input corresponding to the deleted residue is set to −1 and the newly introduced residue to 1; all other inputs are set to 0 (Cheng et al. 2006, Cheng et al. 2005).

The wildtype sequence, shown below, is the wildtype 676 protein sequence for the PROS1 gene. This sequence, and the mutated PROS1 V510M sequence, also shown below, were input into the servers.


Wildtype:

MRVLGGRCGALLACLLLVLPVSEANFLSKQQASQVLVRKRRANSLLEETKQGNLERECIEELCNKEEAREVFENDPETDYFYPKYLVCLRSFQT
GLFTAARQSTNAYPDLRSCVNAIPDQCSPLPCNEDGYMSCKDGKASFTCTCKPGWQGEKCEFDINECKDPSNINGGCSQICDNTPGSYHCSCKNGFVMLSN
KKDCKDVDECSLKPSICGTAVCKNIPGDFECECPEGYRYNLKSKSCEDIDECSENMCAQLCVNYPGGYTCYCDGKKGFKLAQDQKSCEVVSVCLPLN
LDTKYELLYLAEQFAGVVLYLKFRLPEISRFSAEFDFRTYDSEGVILYAESIDHSAWLLIALRGGKIEVQLKNEHTSKITTGGDVINNGLWNMVSVEELEHS
ISIKIAKEAVMDINKPGPLFKPENGLLETKVYFAGFPRKVESELIKPINPRLDGCIRSWNLMKQGASGIKEIIQEKQNKHCLVTVEKGSYYPGSGIAQFHIDY
NNVSSAEGWHVNVTLNIRPSTGTGVMLALVSGNNTVPFAVSLVDSTSEKSQDILLSVENTVIYRIQALSLCSDQQSHLEFRVNRNNLELSTPLKIETISHE
DLQRQLAVLDKAMKAKVATYLGGLPDVPFSATPVNAFYNGCMEVNINGVQLDLDEAISKHNDIRAHSCPSVWKKTKNS

Mutation:

MRVLGGRCGALLACLLLVLPVSEANFLSKQQASQVLVRKRRANSLLEETKQGNLERECIEELCNKEEAREVFENDPETDYFYPKYLVCLRSFQT
GLFTAARQSTNAYPDLRSCVNAIPDQCSPLPCNEDGYMSCKDGKASFTCTCKPGWQGEKCEFDINECKDPSNINGGCSQICDNTPGSYHCSCKNGFVMLSN
KKDCKDVDECSLKPSICGTAVCKNIPGDFECECPEGYRYNLKSKSCEDIDECSENMCAQLCVNYPGGYTCYCDGKKGFKLAQDQKSCEVVSVCLPLN
LDTKYELLYLAEQFAGVVLYLKFRLPEISRFSAEFDFRTYDSEGVILYAESIDHSAWLLIALRGGKIEVQLKNEHTSKITTGGDVINNGLWNMVSVEELEHS
ISIKIAKEAVMDINKPGPLFKPENGLLETKVYFAGFPRKVESELIKPINPRLDGCIRSWNLMKQGASGIKEIIQEKQNKHCLVTVEKGSYYPGSGIAQFHIDY
NNVSSAEGWHVNMTLNIRPSTGTGVMLALVSGNNTVPFAVSLVDSTSEKSQDILLSVENTVIYRIQALSLCSDQQSHLEFRVNRNNLELSTPLKIETISHE
DLQRQLAVLDKAMKAKVATYLGGLPDVPFSATPVNAFYNGCMEVNINGVQLDLDEAISKHNDIRAHSCPSVWKKTKNS


Obtained Models and Results

DynaMut and DynaMut2

We used DynaMut, a web server which assesses the impact of mutations on protein dynamics and stability resulting from vibrational entropy changes. The figure below shows the DynaMut methodology workflow (Taken from Rodrigues et al. 2018).

The DynaMut methodology can be divided into four steps. In step 1, data was collected from the previously established S2648 subset of mutations with experimental evidence from ProTherm. In step 2, DynaMu combines the effects of mutations on protein stability and dynamics calculated by Bio3D, ENCoM and DUET. In addition, DynaMut also includes a set of complementary information regarding the environment characteristics of the wild-type residue (e.g. relative solvent accessibility, residue depth and secondary structure) and the graph-based signatures generated by mCSM. All these features are used as evidence for training supervised learning algorithms in step 3. After evaluating the performance of the predictive model, the consensus prediction is integrated into the DynaMut web server (Rodrigues et al. 2018).

Results from DynaMut:

http://biosig.unimelb.edu.au/dynamut/results_prediction/16574752080  
ΔΔG Effect
ΔΔG: -0.020 kcal/mol Destabilizing
ΔΔG ENCoM: 0.116 kcal/mol Destabilizing
ΔΔG mCSM: -0.751 kcal/mol Destabilizing
ΔΔG SDM: -2.390 kcal/mol Destabilizing
ΔΔG DUET: -1.145 kcal/mol Destabilizing

ΔVibrational Entropy Energy Between Wild-Type and Mutant

ΔΔSvib ENCoM: -0.144 kcal.mol-1.K-1 (decrease of molecule flexibility).

This is the visual representation of the ΔEntropy Energy in which the amino acids are coloured according to the vibrational entropy change upon mutation. Blue regions indicate rigidification and red, a gain in flexibility.

Prediction of Interatomic Interactions

 
Wild-Type Mutant

Interatomic interaction predictions of wild-type and mutant residues on the output page of DynaMut. Wild-type and mutant residues are colored in light-green and are also represented as sticks. A table with the color definitions for each type of interaction is shown on top. In the mutant, there are more hydrophobic and ionic interactions than in the wildtype.

Ensemble of Wild-type and Mutant

Wild-type and mutant sequences were also extracted from their respective 3D structures and then aligned. The results for each of the sequences are displayed below.

Type of secondary structure on each region of the sequence is added to the top and bottom margins of the plot (helices black and strands gray).

Visual Analysis of Atomic Fluctuation

Atomic fluctuation provides the amplitude of the absolute atomic motion. The database performed the calculations over the first 10 non-trivial, or normal modes of the molecule.

 
Wild-Type Mutant

The magnitude of the fluctuation is represented by thin to thick tube colored, blue (low), white (moderate) and red (high). Atomic fluctuation is an important feature which underlies protein function, giving insight into atom positions.


Visual Analysis of Deformation Energies

Deformation energy provides a measure for the amount of local flexibility in the protein. The calculations were performed over the first 10 non-trivial modes of the molecule.

 
Wild-Type Mutant

The magnitude of the deformation is represented by thin to thick tube colored, blue (low), white (moderate) and red (high).

Results from DynaMut2:

http://biosig.unimelb.edu.au/dynamut2/results_prediction/165742540063

Predicted Stability Change (ΔΔGStability)

 
ΔΔG Effect
ΔΔG: -0.17 kcal/mol Destabilizing

Prediction of Interatomic Interactions

 
Wild-Type Mutant

This diagram, although slightly different, also demonstrates an increase in hydrophobic interactions in the mutant.


SAAFEC-SEQ

SAAFEC-SEQ uses the PsePSSM algorithm to predict the stability changes upon a single mutation in a protein. SAAFEC-SEQ uses gradient boosting decision tree repressor to encode physicochemical properties, sequence features and evolutionary information to compute the change in stability free energy resulting from single mutations. SAAFEC-SEQ outperforms all existing methods in both the Pearson correlation coefficient and root-mean-squared-error parameters for several independent datasets (Li et al. 2021). The figure below shows the SAAFEC-SEQ workflow (taken from Li et al. 2021).

Results from SAAFEC-SEQ:

http://compbio.clemson.edu/SAAFEC-SEQ/results.php?tid=16574248874841&rid=32303  
ΔΔG Effect
ΔΔG: -0.87 kcal/mol Destabilizing

The value of ΔΔG varies significantly from the DynaMut results.


MUpro

MUpro is a set of machine learning programs to predict how a single-site amino acid mutation affects protein stability. It involves two machine learning methods: Support Vector Machines and Neural Networks. Both of them were trained on a large mutation dataset and show accuracy above 84% via 20 fold cross validation. One advantage of this method is that it does not require tertiary structures to predict protein stability changes. Additionally, the prediction accuracy using sequence information alone is comparable to that of using tertiary structures. This is beneficial, because the full protein S tertiary structure is unavailable, but we could still use this server to get accurate predictions (Cheng et al. 2006, Cheng et al. 2005).

Results from MUpro

 
ΔΔG Effect
ΔΔG: -0.84112336 kcal/mol Destabilizing

Both Support Vector Machine and Neural Network confirm that there is a decrease in the stability of the protein structure, with confidence scores of -0.19358089 and -0.513645616184482, respectively. The values differed significantly from the DynaMut results, but were very similar to the results from SAAFEC-SEQ.


Conclusion

The values of ΔΔG varied somewhat between the three different servers. However, all three servers agreed that the mutation had a DESTABILIZING effect on the protein. This destabilizing effect makes the protein more vulnerable to degradation or aggregation. The energy of an unfolded protein is nearly identical with a single point mutation. There was also an increase in hydrophobic interactions between amino acid side-chains, demonstrating possible increased compactness of the molecule, thereby reducing its ability to function normally.


References

Adzhubei, I. A. , Schmidt S., Peshkin L., Ramensky V. E., Gerasimova A., Bork P., et al. 2010. A method and server for predicting damaging missense mutations. Nat. Methods 7:248–249.

Birolo G, Benevenuta S, Fariselli P, et al. Protein stability perturbation contributes to the loss of function in haploinsufficient genes. Front Mol Biosci 2021;8:10.

Carlos HM Rodrigues, Douglas EV Pires, David B Ascher, DynaMut: predicting the impact of mutations on protein conformation, flexibility and stability, Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W350–W355

Free energy of mutation calculations in Cyrus Bench. (n.d.). Retrieved August 16, 2022, from https://cyrusbio.com/wp-content/uploads/Rosetta-cartesian-DDG-2019.pdf.

J. Cheng, A. Z. Randall, M. Sweredoski, and P. Baldi. SCRATCH: a Protein Structure and Structural Feature Prediction Server. Nucleic Acids Research, vol. 33 (server issue), w72-76, 2005

Cheng J, Randall A, Baldi P. Prediction of protein stability changes for single-site mutations using support vector machines. Proteins: Structure, Function, and Bioinformatics 2006;62(4):1125–32.

Compiani, M., & Capriotti, E. (2013). Computational and theoretical methods for protein folding. Biochemistry, 52(48), 8601–8624. https://doi.org/10.1021/bi4001529

Corrado Pancotti, Silvia Benevenuta, Giovanni Birolo, Virginia Alberini, Valeria Repetto, Tiziana Sanavia, Emidio Capriotti, Piero Fariselli, Predicting protein stability changes upon single-point mutation: a thorough comparison of the available tools on a new dataset, Briefings in Bioinformatics, Volume 23, Issue 2, March 2022, bbab555, https://doi.org/10.1093/bib/bbab555

Daneshjou, R., Cavallari, L. H., Weeke, P. E., Karczewski, K. J., Drozda, K., Perera, M. A., Johnson, J. A., Klein, T. E., Bustamante, C. D., Roden, D. M., Shaffer, C., Denny, J. C., Zehnder, J. L., & Altman, R. B. (2016). Population-specific single-nucleotide polymorphism confers increased risk of venous thromboembolism in African Americans. Molecular genetics & genomic medicine, 4(5), 513–520. https://doi.org/10.1002/mgg3.226

Hartl F. U. (2017). Protein Misfolding Diseases. Annual review of biochemistry, 86, 21–26. https://doi.org/10.1146/annurev-biochem-061516-044518

Hou Q, Pucci F, Ancien F, et al. SWOTein: a structure-based approach to predict stability strengths and weaknesses of prOTEINs. Bioinformatics 2021;37(14):1963–71.

Kumar, P. , Henikoff S., and Ng P. C.. 2009. Predicting the effects of coding non‐synonymous variants on protein function using the SIFT algorithm. Nat. Protoc. 4:1073–1081.

Li G, Panday SK, Alexov E. SAAFEC-SEQ: a sequence-based method for predicting the effect of single point mutations on protein thermodynamic stability. Int J Mol Sci 2021;22(2):606.

Lijun Quan, Qiang Lv, and Yang Zhang, STRUM: Structure-based stability change prediction upon single-point mutation, Bioinformatics, Submitted (2015)

Pires, D. E., Chen, J., Blundell, T. L., & Ascher, D. B. (2016). In silico functional dissection of saturation mutagenesis: Interpreting the relationship between phenotypes and changes in protein stability, interactions and activity. Scientific reports, 6, 19848. https://doi.org/10.1038/srep19848

Potapov, V., Cohen, M., & Schreiber, G. (2009). Assessing computational methods for predicting protein stability upon mutation: good on average but not in the details. Protein engineering, design & selection : PEDS, 22(9), 553–560. https://doi.org/10.1093/protein/gzp030

Rodrigues, C., Pires, D., & Ascher, D. B. (2021). DynaMut2: Assessing changes in stability and flexibility upon single and multiple point missense mutations. Protein science : a publication of the Protein Society, 30(1), 60–69. https://doi.org/10.1002/pro.3942

Protein Modeling

Bioinformatics tools can be used to model a protein structure if the amino-acid sequence of the protein is known. Computational protein structure prediction relies on principles obtained through techniques including X-ray crystallography, NMR spectroscopy and other physical energy functions to predict, with a certain level of accuracy, the three-dimensional structures of proteins. These methods use various Machine Learning algorithms to develop and predict comprehensive protein structures. We decided to use three methods for modeling our protein, protein S, for which there is not a structure available.

1. Homology Modeling

a. This is also known as comparative modeling of protein, and is used when you have a protein with an unknown structure, and a similar structurally known protein. The known protein structure can be used as a template to predict the structure of the unknown protein. First, a primary BLAST search is performed in the Protein Data Bank (PDB), using the protein sequence, to find the template protein that best resembles the unknown protein. Once the matching template is found, several tools can be used to model the unknown structure. We used ChimeraX to predict our protein structure.

2. Threading/Fold Recognition

a. Protein threading, also known as fold recognition, is used to model proteins which have the same fold as structurally known proteins, but do not have homologous proteins with known structures.

3. Ab-Initio Method

a. When there is not a known structure of a similar protein, this method can be used to determine the tertiary structure of a protein. This method conducts a conformational search using a designed energy function and generates a number of possible conformations. From these, final models can be selected.


Protein Sequence

The Protein S (PROS1) sequence comprises 676 amino acids. Through literature review, it was found that PROS1 is synthesized as a 676 amino acid precursor protein which is processed to a mature protein of 635 amino acids. The 41 amino acid difference between the two accounts for a signaling peptide that is necessary for the expression of the protein. A post-translational modification, specifically a simple singular peptide bond cleavage, results in the mature form of the protein that gets secreted. Therefore, we chose to model both the pre-cleaved PROS1 (676 AA), and the truncated mature PROS1 (635 AA).

676 Precursor Protein Sequence

MRVLGGRCGALLACLLLVLPVSEANFLSKQQASQVLVRKRRANSLLEETKQGNLERECIEELCNKEEAREVFENDPETDYFYPKYLVCLRSFQTGLFTAA
RQSTNAYPDLRSCVNAIPDQCSPLPCNEDGYMSCKDGKASFTCTCKPGWQGEKCEFDINECKDPSNINGGCSQICDNTPGSYHCSCKNGFVMLSNKK
DCKDVDECSLKPSICGTAVCKNIPGDFECECPEGYRYNLKSKSCEDIDECSENMCAQLCVNYPGGYTCYCDGKKGFKLAQDQKSCEVVSVCLPLNLDT
KYELLYLAEQFAGVVLYLKFRLPEISRFSAEFDFRTYDSEGVILYAESIDHSAWLLIALRGGKIEVQLKNEHTSKITTGGDVINNGLWNMVSVEELEHSISIKI
AKEAVMDINKPGPLFKPENGLLETKVYFAGFPRKVESELIKPINPRLDGCIRSWNLMKQGASGIKEIIQEKQNKHCLVTVEKGSYYPGSGIAQFHIDYNNV
SSAEGWHVNVTLNIRPSTGTGVMLALVSGNNTVPFAVSLVDSTSEKSQDILLSVENTVIYRIQALSLCSDQQSHLEFRVNRNNLELSTPLKIETISHEDLQR
QLAVLDKAMKAKVATYLGGLPDVPFSATPVNAFYNGCMEVNINGVQLDLDEAISKHNDIRAHSCPSVWKKTKNS

635 Mature Protein Sequence

MRVLGGRCGALLACLLLVLPVSEANFLSKQQASQVLVRKRRANSLLEETKQGNLERECIEELCNKEEAREVFENDPETDYFYPKYLVCLRSFQTGLFTAA
RQSTNAYPDLRSCVNAIPDQCSPLPCNEDGYMSCKDGKASFTCTCKPGWQGEKCEFDINECKDPSNINGGCSQICDNTPGSYHCSCKNGFVMLSNKK
DCKDVDECSLKPSICGTAVCKNIPGDFECECPEGYRYNLKSKSCEDIDECSENMCAQLCVNYPGGYTCYCDGKKGFKLAQDQKSCEVVSVCLPLNLDT
KYELLYLAEQFAGVVLYLKFRLPEISRFSAEFDFRTYDSEGVILYAESIDHSAWLLIALRGGKIEVQLKNEHTSKITTGGDVINNGLWNMVSVEELEHSISIKI
AKEAVMDINKPGPLFKPENGLLETKVYFAGFPRKVESELIKPINPRLDGCIRSWNLMKQGASGIKEIIQEKQNKHCLVTVEKGSYYPGSGIAQFHIDYNNV
SSAEGWHVNVTLNIRPSTGTGVMLALVSGNNTVPFAVSLVDSTSEKSQDILLSVENTVIYRIQALSLCSDQQSHLEFRVNRNNLELSTPLKIETISHEDLQR
QLAVLDKAMKAKVATYLGGLPDVPFSATPVNAFYNGCMEVNINGVQLDLDEAISKHNDIRAHSCPSVWKKTKNS

We modeled these sequences using all the three modeling methods: homology modeling, thread/fold recognition, and ab-initio method. We then compared the results of our models

For threading and ab-initio modeling, we submitted both sequences on the I-TASSER and Robetta Baker Lab’s modeling servers and obtained our results via email.


Homology Modeling

Protein Sequence: The protein sequence used for homology modeling is based on the protein model found on the PDB for Protein S. The PDB file used is 1Z6C, and the protein sequence for this model is as follows: KDVDECSLKPSICGTAVCKNIPGDFECECPEGYRYNLKSKSCEDIDECSENMCAQLCVNYPGGYTCYCDGKKGFKLAQDQKSCEVVS

This sequence starts at amino acid 200 and goes up to amino acid 286. The Protein Database (PDB) only has an 87 amino acid long sequence for Protein S. For homology modeling, we wanted to make sure we modeled a real protein and not a hypothetical version generated through a software. Therefore, we only could model the small number of residues.

The 1Z6C file was modeled by using the ChimeraX software. The amino acids that are known to have mutations that cause Protein S deficiency were highlighted in the color cyan by using the Action tool to color amino acids. The amino acids in the 676 amino acid sequence are: GLU 204, LYS 208, GLY 222, CYS 226, GLU 230, ARG 233, TYR 234, LYS 237, CYS 241, ASP 243, ASP 245, CYS 247, GLU 249, CYS 256, ASN 258, GLY 262, CYS 265, TYR 266, CYS 267, and GLY 269. The side-chains of these amino acids were shown as well using the show atoms/bonds command. Once the the chains were displayed, the hydrogen bonds (dashes colored blue) could be viewed between the sidechains.

This first model highlights the amino acids that are known to undergo mutations that cause Protein S deficiency in cyan. 430 hydrogen bonds were found.

In the second model, the amino acids colored in cyan were mutated and were then colored pink. The swapaa command was used to mutate every amino acid in the protein. The mutations brought about changes in the side-chains which changed interactions and hydrogen bonds between the sidechains. These properties can be further studied to determine how specific mutations can cause protein S deficiency.

This second model highlights the mutated amino acids in pink along with their changed sidechains. 726 hydrogen bonds were found after the mutations had been made.

The amino acid mutations that cause protein S deficiency are described as follows:

  1. ASP 245, which has an electrically charged acidic side-chain, was mutated to GLY 245 which has a nonpolar side-chain.
  2. CYS 247, which has a polar side-chain, was mutated to GLY 247 which has a nonpolar side-chain.
  3. GLU 249, which has an electrically charged acidic side-chain, was mutated to LYS 249 which has an electrically charged basic side-chain.
  4. CYS 256, which has a polar side-chain was mutated to SER 256 which has a polar side-chain.
  5. ASN 258, which has a polar sidechain, was mutated to SER 258 which has a polar side-chain.
  6. CYS 265, which has a polar sidechain, was mutated to TRP 265 which has a hydrophobic side-chain.
  7. TYR 266, which has a hydrophobic side-chain, was mutated to CYS 266 which has a polar side-chain.
  8. CYS 267, which has a polar sidechain, was mutated to SER 267 which has a polar sidechain.
  9. GLY 269, which has a nonpolar side-chain, was mutated to ARG 269 which has an electrically charged basic side-chain.

A few of the mutations from the above list are known to produce certain effects. The mutation from LYS 208, which has an electrically charged basic side-chain, to ASN 208, which has a polar side-chain, is known to cause a significantly higher risk of thrombophilia due to genetically inherited protein S deficiency.

The mutation which is shown in the image on the right results in an additional 6 hydrogen bonds because the number of total hydrogen bonds found in the model after this mutation is 436.

The mutation from GLY 222, which has a nonpolar side-chain, to ARG 222 which has an electrically charged basic side-chain, is known to cause a significantly higher risk of thrombophilia due to genetically inherited protein S deficiency.

The mutation which is shown in the image on the right results in an additional 40 hydrogen bonds because the number of total hydrogen bonds found in the model after this mutation is 470.

The mutation from CYS 226, which has a polar side-chain, to TYR 226 which has a hydrophobic side-chain, is known to cause a significantly higher risk of thrombophilia due to genetically inherited protein S deficiency.

The mutation which is shown in the image on the right results in an additional 21 hydrogen bonds because the number of total hydrogen bonds found in the model after this mutation is 451.

The mutation from GLU 230, which has an electrically charged acidic side-chain, to LYS 230 which has an electrically charged basic side-chain, is known to cause a significantly higher risk of thrombophilia due to genetically inherited protein S deficiency.

The mutation which is shown in the image on the right results in an additional 13 hydrogen bonds because the number of total hydrogen bonds found in the model after this mutation is 443.

The mutation from ARG 233, which has an electrically charged basic side-chain, was mutated to LYS 233 which has an electrically charged basic side-chain, and is known to potentially cause thrombophilia due to protein S deficiency. However, this mutation does not affect the level of Protein S, and the protein is still normally secreted. Instead, it affects protein S activity level, and is known to cause type 2 protein S deficiency, which is significantly harder to diagnose.

The mutation which is shown in the image on the right results in an additional 6 hydrogen bonds because the number of total hydrogen bonds found in the model after this mutation is 436.

The mutation from TYR 234, which has a hydrophobic side-chain, to CYS 234 which has a polar side-chain,is known to cause a significantly higher risk of thrombophilia due to genetically inherited protein S deficiency.

The mutation which is shown in the image on the right results in an additional 29 hydrogen bonds because the number of total hydrogen bonds found in the model after this mutation is 459.

The mutation from LYS 237, which has an electrically charged basic side-chain, to THR 237 which has a polar side-chain,is known to cause a significantly higher risk of thrombophilia due to genetically inherited protein S deficiency.

The mutation which is shown in the image on the right results in an additional 6 hydrogen bonds because the number of total hydrogen bonds found in the model after this mutation is 436.

The mutation from CYS 256, which has a polar side-chain to SER 256 which has a polar side-chain, is known to cause a significantly higher risk of thrombophilia due to genetically inherited protein S deficiency.

The mutation which is shown in the image on the right results in an additional 60 hydrogen bonds because the number of total hydrogen bonds found in the model after this mutation is 490.

The mutation from ASN 258, which has a polar sidechain, to SER 258 which has a polar side-chain, is known to cause a significantly higher risk of thrombophilia due to genetically inherited protein S deficiency. Due to this mutation, only 30% of the normal level of protein S is produced. Furthermore, the secretion of protein S is impaired; any protein that is not secreted is degraded inside cells.

The mutation which is shown in the image on the right results in an additional 1 hydrogen bond because the number of total hydrogen bonds found in the model after this mutation is 431.

The mutation from GLY 269, which has a nonpolar side-chain, to ARG 269 which has an electrically charged basic side-chain is known to cause a significantly higher risk of thrombophilia due to genetically inherited protein S deficiency.

The mutation which is shown in the image on the right results in an additional 17 hydrogen bonds because the number of total hydrogen bonds found in the model after this mutation is 447.

These changes can be further studied to determine how specific mutations can cause protein S deficiency. This is because an increase in the number of hydrogen bonds indicates that there can potentially be a change in the shape and conformation of the protein. The increase in hydrogen bonds can produce a more compact and inflexible protein. This can also strongly affect the protein’s function since a protein’s ability to function depends on its structure.

These changes can be further studied to determine how specific mutations can cause protein S deficiency. This is because an increase in the number of hydrogen bonds indicates that there can potentially be a change in the shape and conformation of the protein. The increase in hydrogen bonds can produce a more compact and inflexible protein. This can also strongly affect the protein’s function since a protein’s ability to function depends on its structure.


Threading/Fold Recognition Obtained Models

The precursor protein sequence and the mature protein sequence were submitted on the I-Tasser web server. The results were obtained via email. I-TASSER uses a threading approach to predict the top 5 models. For each target, I-TASSER simulations generate tens of thousands conformations (called decoys). To select the final models, I-TASSER uses the SPICKER program to cluster all the decoys based on the pairwise structure similarity, and report up to five models which correspond to the five largest structure clusters.

Additionally, the confidence of each model is quantitatively measured by a C-score. The C-score is a scoring function that is based on the relative clustering structural density, as well as the consensus significance score of multiple threading templates. These are used to estimate the accuracy of the I-TASSER predictions. The C-score is in the range of [-5, 2]. A higher value indicates higher confidence.

I-TASSER cannot typically be used to study the effect of mutations on the overall topology of proteins because the overall topology of I-TASSER models for different mutants of the same protein is usually very similar. A structure model can tell you where the mutation is. However, unless you have introduced a substantial number of mutations, or introduce long insertion/deletions, different mutants of the same protein usually have similar final I-TASSER structures. This is because wild type protein and mutant protein of a small number of substitutions usually share almost identical conformation, even if their function might differ significantly.

676 Precursor Protein Sequence

Predicted Normalized B-factor

B-factor is a value to indicate the extent of the inherent thermal mobility of residues/atoms in proteins. In I-TASSER, this value is deduced from threading template proteins from the PDB in combination with the sequence profiles derived from sequence databases. The reported B-factor profile in the figure below corresponds to the normalized B-factor of the target protein, defined by B=(B'-u)/s, where B' is the raw B-factor value, u and s are respectively the mean and standard deviation of the raw B-factors along the sequence.

C-Score Comparison

 
I-TASSER Model C-Score
Model 1 -2.86
Model 2 -2.57
Model 3 -3.48
Model 4 -4.30
Model 5 -4.30

Obtained Models

Model 1

Model 2

Model 3

Model 4

Model 5

Through C-score comparison, model 2 comes out to be the best model predicted by this method.

TM-Align

I-TASSER also uses the TM-align structural alignment program to match the first I-TASSER model to all structures in the PDB library. This section reports the top 10 proteins from the PDB that have the closest structural similarity to the predicted I-TASSER model. Due to the structural similarity, these proteins often have similar functions to the target. This structure is shown below.

Proteins structurally close to the target in the PDB (as identified by TM-align)

Biological Annotations

I-TASSER also reports biological annotations of the target protein by COFACTOR and COACH based on the I-TASSER structure prediction. While COFACTOR deduces protein functions (ligand-binding sites, EC and GO) using structure comparison and protein-protein networks, COACH is a meta-server approach that combines multiple function annotation results (on ligand-binding sites) from the COFACTOR, TM-SITE and S-SITE programs. This is shown below.

Ligand Binding Sites

635 Protein Sequence

Predicted Normalized B-factor

 
I-TASSER Model C-Score
Model 1 -2.33
Model 2 -3.17
Model 3 -4.06
Model 4 -2.77
Model 5 -3.69

Obtained Models

Model 1

Model 2

Model 3

Model 4

Model 5

Through C-score comparison, model 1 comes out to be the best model predicted by this method.

TM-Align

Proteins structurally close to the target in the PDB (as identified by TM-align)

Biological Annotations

Ligand Binding Sites


Ab-Initio Method Obtained Models

The precursor protein sequence and the mature protein sequence were also submitted on the Robetta Baker Lab web server. Robetta is a protein structure prediction server developed by the Baker lab at the University of Washington. It relies on the Rosetta macromolecular modeling suite developed by the Rosetta Commons. The modeling results were obtained within 24 hours and the 5 best models were predicted.

676 Precursor Protein Sequence

Model 1

Model 2

Model 3

Model 4

Model 5

635 Precursor Protein Sequence

Model 1

Model 2

Model 3

Model 4

Model 5


Comparison Parameters

Root Mean Square Deviation (RMSD): In bioinformatics, the root-mean-square deviation of atomic positions is the measure of the average distance between the atoms (usually the backbone atoms) of superimposed proteins. It is a measure to study how good the alignment of two proteins is. The lower the RMSD, better is the alignment and lesser is the deviation of the backbones of the two proteins. An RMSD below 3 is considered good for alignment. We calculated the RMSD of the best models obtained from threading/fold recognition.

Ramachandran Plot: The Ramachandran plot shows the statistical distribution of the combinations of the backbone dihedral angles ϕ and ψ. It gives information about the energetically allowed and disallowed regions in the protein. Having a lower number of residues in the disallowed regions and a high number of residues in the allowed energy regions indicates a good protein structure. We measured this parameter across all models obtained from the ab-initio method, the two best models obtained through threading/fold recognition, and — the models obtained from homology modeling. The Ramachandran plots have been generated using MolProbity. It is most complete for crystal structure of proteins and acts as an active validation tool that produces coordinates, graphics and numerical evaluations. Higher weightage was given to the Ramachandran Plot values over the RMSD values.

RMSD Comparison for Threading/Fold Recognition Models

 
Model RMSD
Threading/Fold Recognition 676 Model 2 15.2±3.4Å
Threading/Fold Recognition 676 Model 1 13.6±4.0Å

Ramachandran Plots

676 AA Protein Sequence

 
Ab-Initio Model With Full Comprehensive Analysis Linked General Case Analysis
676 AA Sequence Ab-Initio Model 1

As can be seen, 93.5% (630/674) of all residues are in favored (98%) regions.

98.7% (665/674) of all residues are in allowed (>99.8%) regions.

There were 9 outliers.

676 AA Sequence Ab-Initio Model 2

As can be seen, 93.3% (629/674) of all residues are in favored (98%) regions.

98.8% (666/674) of all residues are in allowed (>99.8%) regions.

There were 8 outliers.

676 AA Sequence Ab-Initio Model 3

As can be seen, 92.6% (624/674) of all residues are in favored (98%) regions.

98.7% (665/674) of all residues are in allowed (>99.8%) regions.

There were 9 outliers.

676 AA Sequence Ab-Initio Model 4

As can be seen, 90.2% (608/674) of all residues are in favored (98%) regions.

97.8% (659/674) of all residues are in allowed (>99.8%) regions.

There were 15 outliers.

676 AA Sequence Ab-Initio Model 5

As can be seen, 93.6% (631/674) of all residues are in favored (98%) regions.

99.1% (668/674) of all residues are in allowed (>99.8%) regions.

There were 6 outliers.

 
Threading/Fold Recognition Model With
Full Comprehensive Analysis Linked
General Case Analysis
676 AA Sequence Threading/Fold Recognition Model 2

As can be seen, 56.4% (380/674) of all residues are in favored (98%) regions.

80.7% (544/674) of all residues are in allowed (>99.8%) regions.

There were 130 outliers

635 AA Protein Sequence

 
Ab-Initio Model With Full Comprehensive Analysis Linked General Case Analysis
635 AA Sequence Ab-Initio Model 1

As can be seen, 93.4% (591/633) of all residues are in favored (98%) regions.

99.2% (628/633) of all residues are in allowed (>99.8%) regions.

There were 5 outliers.

635 AA Sequence Ab-Initio Model 2

As can be seen, 94.0% (595/633) of all residues are in favored (98%) regions.

98.7% (625/633) of all residues are in allowed (>99.8%) regions.

There were 8 outliers.

635 AA Sequence Ab-Initio Model 3

As can be seen, 94.0% (595/633) of all residues are in favored (98%) regions.

99.2% (628/633) of all residues are in allowed (>99.8%) regions.

There were 5 outliers.

635 AA Sequence Ab-Initio Model 4

As can be seen, 92.6% (586/633) of all residues are in favored (98%) regions.

98.6% (624/633) of all residues are in allowed (>99.8%) regions.

There were 9 outliers.

635 AA Sequence Ab-Initio Model 5

As can be seen, 93.2% (590/633) of all residues are in favored (98%) regions.

99.4% (629/633) of all residues were in allowed (>99.8%) regions.

There were 4 outliers.

 
Threading/Fold Recognition Model With
Full Comprehensive Analysis Linked
General Case Analysis
635 AA Sequence Threading/Fold Recognition Model 1

As can be seen, 59.4% (376/633) of all residues are in favored (98%) regions.

81.7% (517/633) of all residues are in allowed (>99.8%) regions.

There were 116 outliers.


Homology Modeling

 
Ab-Initio Model With Full Comprehensive Analysis Linked General Case Analysis
Homology Modeling Natural Protein

As can be seen, 64.7% (55/85) of all residues are in favored (98%) regions.

89.4% (76/85) of all residues are in allowed (>99.8%) regions.

There were 9 outliers.

Homology Modeling With Mutations

As can be seen, 94.0% (595/633) of all residues are in favored (98%) regions.

98.7% (625/633) of all residues are in allowed (>99.8%) regions.

There were 8 outliers.

Overall, our Ramachandran plots validated our various different 3-D protein models, with the Ab-Initio models being the most reliable. The Ab-Initio models have higher numbers of residues in the allowed energy regions on the Ramachandran plots, which led us to conclude that Ab-Initio model 3 for the 635 aa protein sequence was potentially the best model for post-translational, functional protein S, with the highest percentage of residues in favored and allowed regions.


Conclusion

Through protein modeling, we were better able to understand the behavior of protein S, allowing us to understand how cells function and how the misfolded protein can cause disease. This data proved useful in allowing us to understand how the protein structure can change with different genetic mutations, and how this contributes to type I, type II, and type III protein S deficiency.

Through the homology protein modeling we were able to see how the mutations changed amino acid side-chains, which changed interactions and hydrogen bonds between the sidechains. In the original model, 430 hydrogen bonds were found. In the mutated model, 726 hydrogen bonds were found.

The increase in hydrogen bonds shows that there is a change in interactions due to amino acid mutations. It is very likely that the change in the protein’s structure brought about by changes in bonding within the protein can affect its function. In particular, the increase in hydrogen bonds can produce a significantly more compact and inflexible protein.Therefore, it can be concluded that another aspect of research includes studying the new interactions brought about by mutations, which can help us understand how specific mutations can cause protein S deficiency.

For example, there are certain mutations that are known to cause this disorder. These mutations and their outcomes can be related to the structural differences seen before and after the mutations take place. For this reason, the mutations that cause certain side effects related to protein S deficiency were zoomed in on in order to study the difference in hydrogen bonds before and after the mutation. Modeling these mutations gave us better insight and understanding into how these types of protein S deficiency differ, and allowed us to explore potential diagnostic methods for each type.

The protein modeling also gave us better insight into possible mutations that may arise when trying to express a recombinant version of the protein, and what the possible variations of our product might be from natural, functional protein S. Overall, generating possible structures provided us with a greater level of understanding of how protein S works, allowing us to create hypotheses about how to affect, control, and modify it. Even in the future, beyond iGEM, knowing the protein’s structure can also allow us to possibly design site-directed mutations with the intent of changing the protein function. We also plan on looking into the interaction between protein S and protein C, its binding partner, and using protein modeling to help predict their activity. This modeling provides an important first step that offers invaluable insight into our protein structure, function, and molecular dynamics.


References

Bordoli, L., Kiefer, F., Arnold, K., Benkert, P., Battey, J., & Schwede, T. (2008). Protein structure homology modeling using Swiss-Model Workspace. Nature Protocols, 4(1), 1–13.https://doi.org/10.1038/nprot.2008.197

Drakenberg, T. (2005, May 25). Solution Structure of the Ca2+-Binding EGF3−4 Pair from Vitamin K-Dependent Protein S: Identification of an Unusual Fold in EGF3†,‡. ACS Publications.https://pubs.acs.org/doi/10.1021/bi050101f

HHpred (Söding lab). Söding J. (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951-960. doi:10.1093/bioinformatics/bti125.

Map align (Ovchinnikov lab). S Ovchinnikov et al. Protein Structure Determination using Metagenome sequence data. (2017) Science. 355(6322):294–8.

Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, Qian Cong, Lisa N. Kinch, R. Dustin Schaeffer, Claudia Millán, Hahnbeom Park, Carson Adams, Caleb R. Glassman, Andy DeGiovanni, Jose H. Pereira, Andria V. Rodrigues, Alberdina A. van Dijk, Ana C. Ebrecht, Diederik J. Opperman, Theo Sagmeister, Christoph Buhlheller, Tea PavkovKeller, Manoj K Rathinaswamy, Udit Dalwadi, Calvin K Yip, John E Burke, K. Christopher Garcia, Nick V. Grishin, Paul D. Adams, Randy J. Read, David Baker. Accurate prediction of protein structures and interactions using a 3-track network. Science 10.1126/science.abj8754 (2021); doi:https://doi.org/10.1126/science.abj8754.

Naozumi Hiranuma, Hahnbeom Park, Ivan Anishchanka, Minkyung Baek, David Baker. Improved protein structure refinement guided by deep learning based accuracy estimation. (2020) bioRxiv preprint. https://doi.org/10.1101/2020.07.17.209643.

National Library of Medicine. (2022, May). vitamin K-dependent protein S isoform 2 preproprotein [Homo sapiens] - Protein. NCBI.https://www.ncbi.nlm.nih.gov/protein/192447438

Protein Data Bank. (2022, Mar 2). 1Z6C. RCSB PDB.https://www.rcsb.org/structure/1Z6C

RaptorX (Xu lab). Morten Källberg, Haipeng Wang, Sheng Wang, Jian Peng, Zhiyong Wang, Hui Lu & Jinbo Xu. Template-based protein structure modeling using the RaptorX web server. Nature Protocols 7, 1511-1522, 2012.

Rost, B., Schneider, R., & Sander, C. (1997). Protein fold recognition by prediction-based threading. Journal of Molecular Biology, 270(3), 471–480. https://doi.org/10.1006/jmbi.1997.1101

Protein Data Bank. (2022, Mar 2). 1Z6C. RCSB PDB.https://www.rcsb.org/structure/1Z6C

SigmaAldrich. (2022). Amino Acids Reference Chart. Millipore Sigma - Amino Acids Reference Chart.https://www.sigmaaldrich.com/US/en/technical-documents/technical-article/protein-biology/protein-structural-analysis/amino-acid-reference-chart

Sparks-X (Zhou lab). Yuedong Yang, Eshel Faraggi, Huiying Zhao, Yaoqi Zhou. Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of the query and corresponding native properties of templates. Bioinformatics 27:2076-82(2011)

Srivatsan Raman, Robert Vernon, James Thompson, Michael Tyka, Ruslan Sadreyev, Jimin Pei, David Kim, Elizabeth Kellogg, Frank DiMaio, Oliver Lange, Lisa Kinch, Will Sheffler, Bong-Hyun Kim, Rhiju Das, Nick V. Grishin, and David Baker. Structure prediction for CASP8 with all-atom refinement using Rosetta. (2009) Proteins 77 Suppl 9:89-99.

TrRosetta and DeepAccNet. Jianyi Yang, Ivan Anishchenko, Hahnbeom Park, Zhenling Peng, Sergey Ovchinnikov, and David Baker. Improved protein structure prediction using predicted interresidue orientations. (2020) PNAS 117 (3) 1496-1503.

UniProtKB - P07225 (PROS_HUMAN). (n.d.). Retrieved June 15, 2022, fromhttps://www.uniprot.org/uniprot/P07225

UniProt. (2022). UniProt. UniProt - P07225 · PROS_HUMAN.https://www.uniprot.org/uniprotkb/P07225/entry

U.S. National Library of Medicine. (n.d.). CCDS report for consensus cds. National Center for Biotechnology Information. Retrieved June 15, 2022, fromhttps://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi?REQUEST=CCDS&GO=MainBrowse&DATA=CCDS2923.1

Yifan Song, Frank DiMaio, Ray Yu-Ruei Wang, David Kim, Chris Miles, TJ Brunette, James Thompson and David Baker. High resolution comparative modeling with RosettaCM. Structure. 2013 Oct 8;21(10):1735-42.

Yuan, X., Shao, Y., & Bystroff, C. (2003). Ab initio protein structure prediction using pathway models. Comparative and Functional Genomics, 4(4), 397–401.https://doi.org/10.1002/cfg.305

Zhang, Y. (2008). I-tasser server for protein 3D structure prediction. BMC Bioinformatics, 9(1).https://doi.org/10.1186/1471-2105-9-40