| CPU_Nanjing - iGEM 2022

model

Introduction


    For our project, phosphorus production prediction calculation is a very important part when it is applied into practice. In this model, we found that the bacterial concentration was directly proportional to the OD value, and 1OD approximately corresponded to 1×10^9 cells/ml. Then we used the Logistic growth model to predict the bacterial population number. However, since the Logistic growth model is idealized, there is still some difference between it and the reality, so we add the regular term to the Logistic growth model. After that the prediction accuracy of the improved Logistic growth model is significantly improved. Later we established the differential equation and linked it with the Logistic growth model to predict the phosphorus yield. However, the phosphorus yield in the experimental data would fluctuate, and the previous model could not describe such volatility, so we converted the data into time series data. Machine learning was performed on the transformed data to supplement the experimental data, and the previous model was further verified. Moreover, according to the experimental data we find out that the bacteria still have the ability to fix the carbon, thus we build up a model for prediction of the amount of the fixed carbon based on the experimental data. By synthesizing the above models, we finally get a comprehensive model for our project.

Module of increment of microorganism


    In the natural environment, it is common that the group number increase when the group enters a new ecosystem. In the real world, there are two kinds of typical growth:

    (1) The J-type growth: The group in this kind of environment will not be limited by factors such as food and space which is an ideal environment for its growth. In this case, the growth function is an exponential function, and its growth function image is shaped like a J.

    (2)The S-shaped growth: Most of the environments are not an ideal living environment for the groups.In this kind of situation, the growth function graph is S-shaped, and it approximately conforms to the logistic growth model, which are often used for population prediction.

    However, the J-shaped growth is only exist in ideal environment, which does not exist in the nature world, as a result we prefer not to choose the exponential function model for predicting our population growth.

    As we all known, the logistic growth model is widely used in the prediction of various population growth, because of its simplicity and interpretability. Moreover, in the Logistic growth model, the growth process of the population are divided into five periods: the beginning period, the acceleration period, the turning period, the deceleration period, and the saturation period. We found that the process is highly consistent with the real bacterial growth process, as a result we would like to choose the Logistic growth model [1].

    We define the population size of the microorganism in the particular vessel as the function below:
picture

(1)

which denotes that in the particular environment, the population size of the microorganism only depends on time. Thus we have the growth velocity of the microorganism:
picture

(2)


    A critical model that can properly predict the size of the microorganism is the Logistic growth model. The differential form of the Logistic model is:
picture

(3)


where K denotes the maximum population size, r denotes the growth rate constant [2], which represent the maximum of growth velocity G. The initial conditions of the model is: picture

(4)


    Combining the equation (3) and (4), we get the equation below: picture

(5)


    Throuth the equation (5), it is required to fit the experiment growth curve of the macroorganism we use in our project in the particular environment. Then once we have the initial condition of the population size of microorganism, it is simple to predict the size of it at the given time t.
    To fir the curve, the most significant parameter in the equation is the rate constant r. Because the Maximum size constant K and initial consatant [x1,...,xn] can be easily brought out throuth the data. Least square method can fit these curve well, generally the principle of the method is to make the sum of squared difference of each data point to be as small as possible. For a n datas set [x1,...,xn], we want to adjust the parameters of a function, f(x,ω) whose arguments are undefined, and ω represent a data set of unkown parameters, to make the equation below minimum:
picture

(6)


    Iterative process is often the typical method to execute the least square method. More details can be seen in our code.

    In our experiment, there are 2 significant microorganism that take part in the production of phosphate, MR and MRP. We measure the experiment growth curve of these 2 microorganism and fit the logistic curve for them, and it is evident that the model performs well.


picture picture

Figure 1. Logistic fitting curve of the growth curve of microorganism

    MR:
picture

(8)

(7)

    Based on the integrated analysis of the Logistic growth model and experimental data collected from lab, we found a 50% increase in population quantity of MRP when compared to MR.

Module of phosphate formation


    In our project, we can assumpt that the total weight of phosphorus is constant. Recorded as P.
    According to the chemical structure and the distribution of phosphorus, there are total 3 kinds of phosphorus, phosphite ( P3 ) , phosphate ( P5 ) and polyphosphate( Pn ) .


picture

Figure 2. Three kinds of phosphorus


    Thus, the distribution of phosphorus can be recorded as equation below:
picture

(9)


    The transformation of phosphate and polyphosphate obeys the path P3 → P5 → Pn . Thus we have a kinetic equation below:
picture

(10)


    In our experiment, the 2 kinds of microorganism, MR and MRP, have different ability of transforming phosphorus. The MR can only transform P3 to P5, while MRP can transform P3 to P5 and P5 to Pn. In the MR system, equation (10) turns into:
picture

(11)


    While in the MRP system, we use the steady state approximation that the total content of P5 is constant(almost equal to 0).
    Thus equation (10) turns into:
picture

(12)


    According to equation (11) and (12), we can simply calculate and predict the percentage of conversion through the transform of P3. Before we find the correct curve to discribe the transformation of phosphorus, we still need to discuss the distribution of different phosphorus .As we know, microorganisms absorb phosphite (P3) and transfer them into phosphate (P5) and polyphosphate (Pn) in their body. Therefore, we can measure the concentration of different forms of phosphorus through measuring the total phosphorus concentration in the solution and the microorganism's body. Assumpt the volume of solution is V, the concentration of phosphorus in solution is Cp
picture

(13)


and the total weight of microorganism is m(t), which is linear correlated with the population size of the microorganism N(t). The percentage composition of phosphorus in microorganism is ω.
picture

(14)


where k is a correction factor. And the logistic fitting curve of the 2 microorganism is known in section 2.

    MR:
picture

(15)

    MRP:
picture

(16)


    For MR, we combine equation (11),(13),(14),(15), and we have:


picture

(17)


picture

(18)


    For MRP, it's similar to get the equation. Due to the unknown of 2 function c(t) and ωMR(t), we still need to fit one of these curves [3]. We firstly try to still use logistic function to fit the curve:


picture

(19)


    Before we derive a formula, we mark the parameter in equation above that. For the logistic function that fit the curve of N(t) K noted as K1, P noted as P1 r noted as r1. While fitting the ω, K noted as K2, P noted as P2, r noted as r2. Then through the equation (7),(18),(19), we can infer that:


picture

(20)


where c is intergral constant, which can be determined in our initial condition.

    From boundary condition, we know that


picture

(21)


picture

(22)


    Thus we can derive the unknown parameter k and C.

    Besides, if we want to use this form of function to fit the concentration instead of fitting ω(5) with logistic function, we can simplify its parameter:


picture

(23)


picture picture

Figure 3. Concentration of phosphorus in solution


    With regard to the production of phosphate, we solved for the limit of our prediction model and involved actual data in the calculation. A conclusion could be drawn that MRP has a phosphate production approximately 5 folds of MR. Considering MRP’s advantage in population quantity, we believe that MRP produces phosphate 7-8 folds comparing to MR.

Calculation of COD


    In this section, we only talk about MRP microorganism.
When the total system is balenced. For the consumption of phosphorus, we note it as equation below:


picture

(24)


picture

(25)


where Pm is the producing rate of phosphorus of unit population size of microorganism, Pa is the consuming rate of phosphorus of unit population size of algae, Pm and Pa have no relation with time, and Na(t) is the population size of algae.
    Thus we have

picture

(26)


    For COD. total productivity is:


picture

(27)


    Combining equations above, we have the COD productivity equation when the system is balanced.


picture

(28)


where NMRP (t) required to be fitting to the real microorganism population growing curve with logistic function, k is the proportionality coefficient of NMRP (t) and mMRO(t) and can be calculated through equation (26) and (28).


picture

Figure 4. Calculated curve of COD v.s. time(h)


    Phosphate manufactured by our engineered bacteria with concentration of 0.4 OD (0.4*10^8 cells per mL) could support the growth of more algae (one more fold COD) than needed for bacterial growth itself after each cycle [4].

Machine learning model


    According to the curve of experimental data, we find out that the phosphorus production will face a certain fluctuation between 7 hours and 12 hours. However, the model above cannot describe the fluctuation, so we use the time series data and the machine learning data to complement the experimental data in order to verify the model mentioned above.
picture

    As we all know, the ensemble learning can improve the accuracy of the model. Generally speaking, the ensemble learning model can make full use of known data. Moreover it is less sensitive to the noise and has better generalization ability. We use bootstrap aggregating, which is also called bagging, to predict our experimental data. In this kind of training model, there is no strong dependence between the base model, and the bootstrap aggregating can be carried out. The process of bootstrap aggregating is just like a process of collective decision making. Each base model learns separately, and the learning content may not be same which can make the ensemble learning model learn more from the data. However, due to the differences between base models, the result of each model may not be same. As a result, each model has to make predictions alone, and the ensemble learning model use the weighted average of their predictions to make the final prediction. The most apparent advantage of the bootstrap aggregating is that the model takes advantage of the independence between the base models which can reduce errors by averaging significantly.

    The three machine learning models selected in this paper are GBDT, SVM and Random Forest which are widely used in regression tasks.

    Both GBDT and random forest model are classification models based on Boosting algorithm. GBDT model is based on decision tree iteration algorithm. During its training process, it initialize a single decision tree classification whose predicting ability is weak. After that it calculates the gradient value of the loss function, reusing data fitting for many times and repeat calculates gradient value of the loss function. Finally, it uses multiple base model to build a gradient tree for final prediction. The Random Forest method adopts the strategy of constructing a large number of decision trees for classification prediction, and finally combines the prediction results of multiple decision trees to make the final prediction. As a result, its prediction accuracy is far greater than that of a single decision tree.

    SVR algorithm is widely used in the field of machine learning, because of its excellent prediction performance. In the SVR model, it transforms the data from low-dimensional linear inseparable space into a high dimensional linear separable space. Moreover it builds a hyperplane to minimum distance to the hyperplane of all data which is similar to the linear regression model.

    And we use the hyperplane to divide it, and its objective function is:


picture

    subjuect to:

picture picture picture picture

    Finally, we select only the experimental data of the first 9 hours and put them into the model, and use the three models to predict the data of the last 3 hours.


picture picture

Figure 5. The result of the machine learning model


    We find that the three machine learning prediction results are relatively accurate, but the SVR model has some errors compared with the other two models, so we select the other two models to supplement the experimental data.


picture picture

Figure 6. The additional data


Reference

[1] J.H. Matis, T.R. Kiffe, W. Van der Werf, A.C. Costamagna, T.I. Matis, W.E. Grant, Population dynamics models based on cumulative density dependent feedback: A link to the logistic growth curve and a test for symmetry using aphid data, Ecological Modelling 220(15) (2009) 1745-1751.
[2] Blumberg, Logistic growth rate functions, Journal of Theoretical Biology 21(1) (1968) 42-44.
[3] A.R. Sepaskhah, S. Fahandezh-Saadi, S. Zand-Parsa, Logistic model application for prediction of maize yield under water and nitrogen management, Agricultural Water Management 99(1) (2011) 51-57.
[4] S. Pommier, D. Chenu, M. Quintard, X. Lefebvre, A logistic model for the prediction of the influence of water on the solid waste methanization in landfills, Biotechnology Bioengineering 97(3) (2007) 473-482.