Modeling our Project


Model 1: Likelihood of Expression in Cyanobacteria

model2.knit

In the future, we want to experiment with expressing our polymers of choice directly in the cyanobacteria. This would enable us to spend less time and resources spent on transforming and cultivating Pichia pastoris.

A significant limitation of this approach is that we wouldn’t be able to lyse the polymer-producing cells as they would also be required for the subsequent phases of calcium carbonate precipitation. Instead, the produced proteins would need to be directly released into the supernatant.

Therefore, we decided to create a machine learning model that can predict the likelihood of a protein being secreted by cyanobacteria into the supernatant. This model can serve as a starting point for selecting polymers to express in cyanobacteria.

It remains difficult to predict secretion signals of proteins and internal protein sequences may also affect protein secretion. Therefore, we aim to build a model that classifies proteins into 2 categories - secreted and intracellular based on its amino acid composition.

To start, we created 2 datasets for our machine learning model to be trained on. These datasets were built upon a literature search for recombinant proteins produced by Synechocystis. We did not specify the strain due to the limited data available.

Dataset 1 consists of proteins that are expressed and secreted by cyanobacteria into the supernatant.

Dataset 2 consists of the negative samples - proteins that are not secreted by cyanobacteria. However, due to the lack of published negative results, we also included proteins that were not efficiently expressed by cyanobacteria, as well as proteins that were secreted at extremely low levels.

The main drawback of our model is the limited amount of data - especially for dataset 1 as Synechocystis hasn’t frequently been used to express recombinant proteins. We hope this model serves as a starting point, that can easily be modified to include new protein sequences. We also hope to extend this dataset ourselves as we continue working with cyanobacteria.

The NCBI database was used to download all the proteins’ amino acid sequences in a FASTA file format.

# load FASTA files onto R 

dataset1 <- readFASTA("dataset1.fasta")
dataset2 <- readFASTA("dataset2.fasta")
# The protcheck function ensure that the sequences only contain the standard 20 amino acids, needed for further steps.

dataset1 <- dataset1[(sapply(dataset1, protcheck))]
dataset2 <- dataset2[(sapply(dataset2, protcheck))]

The protr package allows us to create numerical representations of amino acid sequences, based on various descriptors. Examples of descriptors include amino acid composition, dipeptide composition, pseudo amino acid composition, amphiphilic pseudo amino acid composition.

We are looking at specifically the pseudo amino acid composition of our proteins - which numerically represents the sequence of amino acids, its properties, while retaining the sequence order information.

# Determine Pseudo Amino Acid Composition for each protein

x1 <- t(sapply(dataset1, extractPAAC))
x2 <- t(sapply(dataset2, extractPAAC))

#bind both datasets together
x <- rbind(x1, x2)
#add classification labels (0 - dataset1, secreted proteins & 1 - dataset2, intracellular proteins)
labels <- as.factor(c(rep(0, length(dataset1)), rep(1, length(dataset2))))

Next we split the dataset into a training and test set. The training set comprises of 75% of the data and is the data the model uses to learn from. The remaining 25% of the data is used to test its predictions.

set.seed(1001)

# split training and test set
tr.idx <- c(
  sample(1:nrow(x1), round(nrow(x1) * 0.75)),
  sample(nrow(x1) + 1:nrow(x2), round(nrow(x2) * 0.75))
)
te.idx <- setdiff(1:nrow(x), tr.idx)

x.train <- x[tr.idx, ]
x.test <- x[te.idx, ]

y.train <- labels[tr.idx]
y.test <- labels[te.idx]

Cross-validation of the dataset 5 times allows us to use the entire dataset for training and testing. The dataset is split into 5 - 4 parts are used as training and 1 for testing. This is repeated 5 times, such that each part is used for testing once and the remaining for training.

Random forests is an ensemble learning method that can be used for classification. This approach uses many “trees” - classification questions based on calculated features to determine which class a protein belongs to (secreted or intracellular). The class with the most votes ends up being the final result.

rf.fit <- randomForest(x.train, y.train, cv.fold = 5)
rf.fit
## 
## Call:
##  randomForest(x = x.train, y = y.train, cv.fold = 5) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 14.47%
## Confusion matrix:
##    0  1 class.error
## 0 20  6   0.2307692
## 1  5 45   0.1000000
# predict on test set
rf.pred <- predict(rf.fit, newdata = x.test, type = "prob")[, 1]

A ROC curve is a graphical plot that can be used to measure the performance of a classification model. It’s created by plotting the specificity (false positive rate) against the sensitivity (true positive rate). In an ideal model, the specificity remains low & sensitivity is high - meaning the curve rises rapidly and the area under the curve remains high. The area under the curve is used as a guide for the accuracy of the model - a higher number denotes greater accuracy.

# plot ROC curve
plot.roc(y.test, rf.pred, grid = TRUE, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases

Finally, we use our model to predict whether our protein of choice is likely to be secreted or not.

Thus, our model can be used to predict the likelihood of the target protein being secreted. Furthermore, the amino acid sequence can be modified to see if that results in a change in the likelihood of being secreted.

#1 - Gelatin

testset <- readFASTA("testset.fasta")
testset1 <- testset[(sapply(testset, protcheck))]

testset2 <- t(sapply(testset1, extractPAAC))

pred1 <- predict(rf.fit, newdata = testset2 , type = "prob")[, 1]

pred1
##                PHB         SpiderSilk  Collagen(partial) Bacterial-Collagen 
##              0.650              0.650              0.518              0.596 
##           Collagen 
##              0.528

Our results show that none of our current polymers are likely to be secreted.


Model 2: Upscaling of Pichitecture

Our Idea

As we, in our project, want to offer alternatives to current large-scale building material manufacturing, the idea of this model was, naturally, to simulate a possible upscaling of the Pichitecture idea.

Assumptions and Data

The composition of this potential pichitecture brick is based on the paper by Chelsea (2020), from which we calculated the necessary amount of gelatine and CaCO3 for 1 tone Pichitecture (Table 1) [1]. All datapoints were taken from Chelsea 2020.

Mwet = 1000kg * rwet/ rdry

Table 1. Chelsea 2020
rwet 1800 kg (m3)-1
rdry 1600 kg (m3)-1
Mwet 1125 kg

The Pichitecture building material is made of 70% Sand and 30% medium. For the model we concentrate on the gelatine and CaCO3 as the main ingredients.

Composition of a assumed Pichitecture Brick
Visualization of Pichitecture Brick Components

For 1000kg dry Pichitecture Brick 337.5 L medium are needed. For the medium 33.75 kg gelatine and 3.38 kg CaCO3 are needed.

Gelatin

Gelatin is not an easy recombinant protein and is hard to produce in a single cell organism. This model and the wet lab work is largely based on the paper from Gellermann et al. (2019)[2]. They successfully expressed a non-hydoxylated gelatin mimetic. With the given data and some assumptions based on Looser et al. (2015) [3], a theoretical fermentation process was modelled for a continuous stirred tank reactor (CSTR) with ~ 900L working volume. The starting culture is not modelled. Assumed yield for gelatin was 3g/l.

Table 2. Gellermann et al. (2019) and Looser et al. (2015)
Batch Phase
Glycerol (S) 120 g/l Gellermann
YieldGlycerol (X/S) 0.45 g/g Looser
mmax Glycerol 0.29 h-1 Looser
Fed Phase
Fed Phase (MeOH) 0.02 % Gellermann
Fermentation End 80 h Gellermann
mmax MeOH 0.002 h-1 Looser
YieldMeOH (X/S) 0.008 g/g Assumption based on Looser
Yield(P/X) 1.9 mg/g*h Assumption based on Gellermann
Fermentation Gellermann 2019
Fermentation acc. to Gellermann et al. (2019)

The limiting factor in the recombinant gelatine production is the biomass at the beginning of the MeOH Fedphase. If we can reuse the Pichia pastoris for more than one fed phase the process is more efficient.

For the 33.75kg gelatin, we need a supernatant volume of 11.250 m3 with a titer of 3 g/L. It would need over 6m3 pure MeOH and a Fed-Batch fermentation would take over 1000 h in total process time [2,3].

Calcium Carbonate

The CaCO3 precipitates in one or both reactions:

Beside this two reactions the solubility of calcium carbonate need to be exceeded. This is calculated by the supersaturation of the solution. The activity of Ca2+ and CO3- and solubility product can be seen used here:

WCaCO3 = aCa2+*aCO2-3/KSCaCO3

The oceans are supersaturated with Ca2+ and also HCO3-, but the spontaneous precipitation is hindered by various kinetic effects.

The inorganic carbon concentration mechanism (CCM) of cyanobacteria and microalgae leads to an extracellular space where calcium carbonate can precipitate. The CCM enriches the carbon dioxide level in the cell 1000-fold to the surrounding medium. This high number of CO2 is necessary for the first enzymatic step of the calvin basham benson cycle. The alkalization of the S-Layer of the cell and the activity of the extracellular carbonic anhydrase (CA) leads to the precipitation. Yang et al 2016 have shown, with a deletion of the extracellular CA, the crucial roll of this enzyme in the precipitation [4,5].

Jansson et al 2010
From Jansson et al 2010 [6]: Representation of the CCM

The model for the needed calcium carbonate for one ton of Pichitecture is based on this equation from Lamerand 2022:

DCaprecipitated(mmol) = 2.0 * DBMproduced(gwet)

From this we determine that the needed mass of CaCO3 is 3.375 kg/t.

MCaCO3 = 100.07 g/mol

3375 g/ 100.09 g*mol-1 = 33.72 mol

(DCaprecipitated(mol)* 103)/2.0 = DBMproduced(gwet)

33.72 mol* 103)/2.0 = DBMproduced(gwet)

Conclusion from this model: 16.9 kg biomass are needed to produce 3.375 kg CaCO3.

Carbon Capture

The carbon flux and carbon balance for the Pichia pastoris was not calculated. Therefore, we are not able to calculate the fixated CO2 in gelatine. But we are able to make an assumption of captured CO2 for the needed MeOH.

One last assumption has to be made: the methanol is synthesised by a innovative methode out of H2, captured CO2 and solar energy [7].

CO2(g) + 3H2(g) « CH3OH(l) + H2O(g)

6000 L * 0.735 kg/L = 4410 kg

4410 kg / 32.04 kmol*kg-1 = 137,7 kmol (MeOH)

137.7 kmol * 44,01 kg/kmol (CO2) = 6057 kg

Conclusion: 6057 kg CO2 would be captured for the gelatine production.

For the CaCO3 production, the single carbon source is gaseous CO2.

The ratio from wet to dry biomass is factor 10.

16.9 kgwet ® 1.7 kgdry

According to Lamérad (2022) the dry BM consist to 50% of organic carbon (0,85 kg). The CaCO3 with 3,375 kg consist also from captured CO2.

0,85kg / 12,01 kmol/kg = 70,2 mol

Captured CO2: 70,2 mol + 33,75 mol = 104 mol

0,104 kmol * 44,01 kg/kmol = 4,6 kg

Conclusion: all in all, we would capture over 6000kg from the atmosphere during production.


References