Science — without the risk of a lab leak
Here we have used mathematical modelling to justify and direct decisions we have made throughout our project. We have used ODE growth models, phase space analysis, stochastic simulations, Markov chain analysis, and a number of statistical analysis techniques to help guide our project.
We were able to quantify how much better V. natriegens can be over E. coli as a chassis organism. We fine-tuned our mutation mechanism, and quantitatively compared its simulated long run diversity to naturally occurring genes found in the BLAST database. We also justified using dilution in our bioreactor.
Here we show that V. natriegens gives lower fixation times (where the size of the smaller (initially) fitter sub-population overtakes the rest of the population) than E. coli. The exponential model shows that fixation time, \(\tau \propto \frac{1}{r_{A}\phi},\) decreases with growth rate, \(r_{A},\) and fitness factor, \(\phi\). Stochastic simulation results show fixation times \(\tau_{E.coli}=0.84 (0.69,0.98),\) and \(\tau_{V.nat}=0.56 (0.48,0.68)\) in arbitrary units of time. This suggests using V. natriegens could give fixation times \(33\%\) lower than E. coli.
Take an initial colony of size \(N_0=A_0+B_0\), where \(B_0=\frac{1}{N_0}\) represents a fitter sub-population. Population \(B\) is fitter by a factor \(\phi > 1\) so the derivatives are: \(\dot{A}=r_{A}A,\) and \(\dot{B}=\phi r_{A}B\). This model assumes a bacterial colony follows exponential growth, so is only appropriate when describing the log phase of growth.
We want to find the time, \(\tau\), where \(A(\tau) = B(\tau)\), where the fitter sub-population \(B\) overtakes population \(A\). Integration by separation of variables shows \(A(t)=A_{0}\exp{(r_{A}t)}=\frac{N_{0}-1}{N_{0}}\exp{(r_{A}t)}\), and \(B(t)=\frac{1}{N_{0}}\exp{(\phi r_{A}t)}\).
At time of overtake \(\tau\), we have \(A(\tau)=B(\tau)\), which rearranges and simplifies to \(\tau=\frac{\ln{(N_{0}-1)}}{r_{A}(\phi - 1)}\)
This tells us that \(\tau \propto \frac{1}{r_{A}\phi}\). This provides justification for using V. natriegens for its lower doubling time, as we will see fixation of fitter mutants faster. Furthermore, the lower doubling time can compensate for a lower fitness factor, \(\phi\), if we want to maximise for properties like dynamic range in our selection mechanism instead of raw fitness.
To make our model more realistic, we include a carrying capacity, \(K\), so our model now looks like \(\dot{A}=r_{A}A(\frac{1-N}{K})\), and \(\dot{B}=\phi r_{A}B(\frac{1-N}{K})\), where \(N=A+B\). This is known as coupled logistic growth, but we cannot solve for \(A\) and \(B\) analytically. The inclusion of a carrying capacity, now models a growth rate that slows as the population increases. This can now better represent the log phase as it moves into the stationary phase.
Equilibrium points are points where the system is not changing, and are found at \(\frac{dA}{dt}=\frac{dB}{dt}=0\). These \((A,B)\) points occur at \(e_{1}=(0,0), e_{2}=(0,1)\) and \(e_{3}=(1,0)\). The feasible region (where quantities take biologically sensible values) of this phase space is \([0,1]^2\), and \(B=1\) represents total fixation on the fitter variant. We can visualise the vector field and trajectories of this system in a phase portrait (Figure 1).
The problem here is that fixation does not occur for any \(A_{0} > B_{0}\). This is because of the line of equilibria, \(B=1-A\). This means that once a trajectory hits this line, it stops changing. This tells us that this deterministic model is too rigid to describe our system well. To move past this, we now analyse the stochastic analogue of this model, which has more flexibility than the deterministic ODE model, as it shouldn't settle on this line.
Here, colony growth is simulated using the Gillespie algorithm, with the following events and rates. +1 A: \(\alpha A(1-\frac{N}{K})\), -1 A: \(\gamma A\), +1 B: \(\beta B(1-\frac{N}{K})\), and -1 B: \(\delta B\). We take \(\gamma=\delta\), and \(\beta=\phi \alpha\), where \(\phi\) is the fitness factor. We take \(\alpha=2.1\), \(\gamma=2.0\), \(K=1000\), \(\phi=5.0\) and initial condition \(A_{0}=500\), \(B_{0}=10\). This initial condition represents a relatively small fitter sub-population at the start of the simulation. To implement V. natriegens's faster growth rate, we take \(\alpha_{vib}=\frac{20}{7} \alpha \approx 2.86 \alpha\). This is a realistic choice of growth rate difference, as V. natriegens has a doubling time close to \(7\) minutes and E. coli has a doubling time close to \(20\) minutes. We compare both simulations using the time taken to reach \(B>A\) as our primary metric of interest (representing fixation time). Simulations are repeated \(1000\) times and distributions of fixation times are analysed.
Simulated fixation time for E. coli is found to be \(\tau_{E.coli}=0.84 (0.69,0.98)\). Figure 2 shows the distribution of fixation times for E. coli.
Simulated fixation time for V. natriegens was found to be \(\tau_{V.nat}=0.56 (0.48,0.68)\). Figure 3 shows the distribution of fixation times for V. natriegens.
Results show that V. natriegens has fixation times \(33\%\) lower than E. coli.
Growth simulation also finds that bioreactor dilution can further accelerate fixation of beneficial genes. This dilution would look like an inflow of fresh media washing out a proportion of the existing culture. Including dilution capabilities (which we will have in our bioreactor), results show fixation times of \(\tau_{E.coli}=0.66 (0.47,0.85),\) and \(\tau_{V.nat}=0.26 (0.19,0.33).\) This suggests using V. natriegens could give fixation times \(61\%\) lower than E. coli with \(50\%\) dilutions when colony size reaches half of carrying capacity. The marginal effect of dilution on V. natriegens (at these parameter values) is a decrease in fixation time of \(54\%\). Here fixation time is defined as it was in the stochastic growth model (the time at which \(B > A\)).
We implemented this in the model by including the statement "if \(N > 0.5K\) dilute culture to half concentration" directly into the stochastic model used before. This helped to avoid the growth rate decreasing effect of approaching carrying capacity.
Fixation time for E. coli with dilution was found to be \(\tau_{E.coli}=0.66 (0.47,0.85)\). Figure 4 shows the distribution of these times.
Fixation time for V. natriegens with dilution was found to be \(\tau_{V.nat}=0.26 (0.19,0.33)\). Figure 5 shows the distribution of these times. Compare this to Figure 3 to see the beneficial effect of dilution on fixation time.
Results show that diluting to half concentration at \(0.5K\) decreases fixation time for V. natriegens by \(54\%\) compared with not diluting at all.
Here we begin to build a stochastic simulation model of our MutaT7 mutation mechanism, to answer some of our project's key questions. Our two mutation enzymes, cytosine deaminase (CD) and adenine deaminase (AD), are only capable of very specific mutations. CD is capable of mutating a cytosine (C) to a thymine (T), with a \(0.1\) probability of hitting the antisense strand instead. This off-strand mutation can be read as a mutation of a guanine (G) to an adenine (A) on the sense strand. AD is capable of A to G mutations, again with a \(0.1\) off-strand mutation probability, resulting in a T to C sense strand mutation. Precise mutation rates are not yet known, but simulations were run for sufficient time to reach long run states.
The state space of each nucleotide is {A, C, G, T}. This means the size of the sequence space for a gene of length \(k\) is \(|S_{k}| = 4^{k}.\) Since we can only mutate within (but not between) {A, G} (purines) and {C, T} (pyrimidines), the size of reachable sequence space for a gene of length k is \(|\Sigma_{k, X_{0}}| = 2^{k},\) where \(X_{0}\) is the initial DNA sequence. From these numbers, we deduce that the proportion, \(\rho\), of reachable sequences from an initial sequence \(X_{0}\) is \(\rho_{k} = \frac{|\Sigma_{k, X_{0}}|}{|S_{k}|} = 2 ^ {-k}\) for a sequence of length k. This appears to be problematic for the objective of this project, since the Neisseria meningitidis lactose producing NmLgtB gene is \(828\) nucleotides long, which gives \(\rho_{828}\approx5.6 \times 10^{-250}\). This looks negligibly small, so further modelling is required to verify the potential of the project.
In this section we consider the stationary distribution of a purine in a sequence. This will give us the long run probability of a purine being an A or a G. The same maths applies to pyrimidines. Given a generator matrix \(G=\) \begin{pmatrix} -\alpha & \alpha\\ \beta & -\beta \end{pmatrix} we find the stationary distributions by solving \(\boldsymbol{\hat{\pi}}G=\boldsymbol{0}\) for \(\boldsymbol{\hat{\pi}}.\) The matrix can be \(2\times 2\) here because a purine space cannot be cytosine or thymine. In general, \(\boldsymbol{\hat{\pi}}=(\frac{\beta}{\alpha + \beta},\frac{\alpha}{\alpha + \beta}).\)
When only using cytosine deaminase, we have \(\alpha=0, \beta=0.1\). This results in a stationary distribution of \(\boldsymbol{\hat{\pi}}=(1,0)\), which is a problem because if all purines become adenines in the long run, our system will remove all diversity.
When using both cytosine deaminase and adenine deaminase, we have \(\alpha=0.5, \beta=0.5\). This results in a stationary distribution of \(\boldsymbol{\hat{\pi}}=(0.5,0.5)\), which is much more useful to us. Therefore, we use both deaminase enzymes in our project.
One benefit of analysing state spaces, is that we make very few assumptions about our system. For this reason, we can have more certainty that findings here are true, where other findings come with confidence intervals and are subject to more assumptions.
In this section we continue analysing our MutaT7 mutation mechanism using Markov Chain Monte Carlo simulations to determine that two T7 promoters gives substantially better diversity than just one T7 promoter. In fact we find the two T7 system has \(130\%\) more diversity (by the length normalised Levenshtein distance metric) than the single T7 system. Here we assume that point mutation events are independant and identically distributed, and that time between any two point mutation events is exponentially distributed with a rate parameter that is the sum of the individual mutation rates.
The Levenshtein distance between two character sequences is the minimum number of single-character edits (insertions, deletions or substitutions) to change one word into the other. For our purposes we measure this distance between two sequences, and divide by the length of the shorter sequence. The result, the length normalised Levenshtein distance (LNLD), is a metric that describes distance between sequences, and can be sensibly applied to seequences of different lengths. This is by no means the only way to characterise distance between sequences, but as it is a fairly simple and general measure, we have decided that it is appropriate. This is in line with the principle of Occam's razor, that simple models are usually better, that statisticians are trained to use.
The only difference we need in the model to compare the two systems, is to change the generator matrix to represent the different point mutation rates of each system. The generator for the single T7 system is \(G_{Single} = \) \begin{pmatrix} -0.9 & 0 & 0.9 & 0\\ 0 & -0.9 & 0 & 0.9\\ 0.1 & 0 & -0.1 & 0\\ 0 & 0.1 & 0 & -0.1 \end{pmatrix} and the generator for the system with both is \(G_{Both} = \) \begin{pmatrix} -0.5 & 0 & 0.5 & 0\\ 0 & -0.5 & 0 & 0.5\\ 0.5 & 0 & -0.5 & 0\\ 0 & 0.5 & 0 & -0.5 \end{pmatrix}.
For the simulation, we started with the NmLgtB gene and let each simulation run until they reached a long run stable state. Originally we let them run for \(24\) hours simulation time (about \(1\) minute \(30\) seconds clock time) but we found that the simulations stabilised after just \(5\) hours simulation time (about \(19\) seconds clock time). This finding made the simulations much more efficient and we could simulate \(1\) mutant sequence every \(19\) seconds.
This simulation was run \(1000\) times (taking over \(5\) hours each!) giving us \(1000\) mutant sequences. To estimate the ditribution of LNLD diversity, without having to calculate all \(498501\) distances between unique pairs of points, we used an unbiased bootstrap estimator using only \(10000\) distances. The estimator takes the arithmetic mean of\(k=10000\) distances between random pairs of mutant sequences. Since the expected distance between two random mutant sequence is the same as the arithmetic mean of all the mutant sequences, the estimator is unbiased.
Nucleotide diversity for the single T7 promoter system is LNLD\(=0.18 (0.15,0.21)\). Figure 7 shows the distribution of these distances.
Nucleotide diversity for the two T7 promoter system is LNLD\(=0.46 (0.43,0.48)\). Figure 8 shows the distributions of these distances.
We also consider amino acid diversity, by translating DNA sequences codon by codon into their corresponding amino acid sequence. Amino acid diversity for the single T7 promoter system is LNLD\(=0.32 (0.27,0.38)\). Figure 9 shows the distributions of these distances.
Amino acid diversity for the two T7 promoter system is LNLD\(=0.72 (0.67,0.77)\). Figure 10 shows the distributions of these distances.
In conclusion, we see much better diversity when using the second T7 promoter. This will allow our mechanism to explore more of sequence space and hopefully find more beneficial mutations.
We have now shown how much diversity our two T7 promoter system can induce - but how does this compare to the diversity found in nature? In this section we compare diversity between our mechanism and nature. We sourced the naturally occurring gene sequences from the BLAST database. We find our mechanism to induce \(67\) percentage points more LNLD diversity than naturally occurring genes. We again used bootstrap estimates of inter-sequence diversity for the BLAST sequences and focused on the amino acid sequences.
To select the BLAST sequences in an unbiased way, we searched the whole protein database for sequences that closely resembled the NmLgtB or contained subsequences similar to NmLgtB. Click here for the exact search file. One assumption this makes is that the only genes that have similar function to NmLgtB also contain a subsequence that is similar to NmLgtB. This is a bit of a stretch, but it is acceptable for our purposes, because our mutation system is centred around NmLgtB, so we care more about diversity around NmLgtB than anywhere else. The lowest similarity score within the BLAST sequences was about \(30\%\) identity, which shows there is a wide range of genes that resembling NmLgtB that can produce lactose.
For equal length BLAST sequences, we find LNLD\(=0.05 (-0.08,0.18)\). Figure 11 shows the distributions of these distances.
For unequal length BLAST sequences (within \(10\) amino acids of each other), we find LNLD\(=0.18 (-0.28,0.64)\). Figure 12 shows the distributions of these distances.
We can see here that for the equal length BLAST sequences, our MutaT7 mutation mechanism is capable of substantially more diversity. This holds on average for unequal length BLAST sequences, but there are exceptions (see the bump above \(0.8\) LNLD in Figure 12) and we see much wider \(95\%\) confidence intervals.
We do see more diversity amongst our mutant sequences than in the BLAST sequences. However, we can reasonably assume that most if not all of the BLAST sequences are viable, whereas our MutaT7 system is likely to cause many mutations that cause the organism to die. So even though we see so much more diversity from our MutaT7 system than in nature, we cannot guarantee that all of it is useful diversity. In the future, we could be able to estimate the proportion of diversity from our MutaT7 system that is useful after we generate some experimental results.
Another way we can think about diversity is by looking at the proportion of each amino acid we see in mutant sequences. Here we see boxplots representing proportion distributions of each amino acid for our two T7 promoter system, and then for the BLAST sequences.
We can see in Figure 13, that both sets of boxplots show a consistently high proportion of leucine, and a consistently low proportion of tryptophan. We see many more extreme values (circles) in the BLAST sequences than in our system, suggesting nature has higher kurtosis distributions of amino acid proportions than our system. This could mean that even though we expect to see more diversity on average, nature could be better at finding extremely different sequences.
Using mathematical modelling, we have provided justification for using V. natriegens over E. coli, and the use of dilution in our bioreactor. We have confirmed that both deaminase enzymes are necessary and both T7 promoters are useful in our system. We have discovered that our system is cabable of mutating into a more diverse space than nature appears to, and that our mutants have somewhat similar proportions of amino acids as are found in nature.