Synthetic biology has the potential to solve many different global problems. By implementing fieldable synthetic biology systems, this potential can be maximized. In order to design fieldable constructs, there are several barriers that synthetic biology must overcome. The first barrier to fieldability stems from the inability to select the specific chassis for a certain environment due to the many factors that impact bacterial survival. To solve this problem, our team has developed our software program, chassEASE, to assist researchers with selecting the optimal chassis for their deployment environment.
Our toolkit, chassEASE, consists of several smaller packages and tools built in Python that operate in an interconnected manner to determine an optimal chassis, specific to environments, through relative abundance and growth accumulation predictions. We developed predictive model packages incorporating multivariate linear regressions, neural network models, a direct sample searching algorithm, and Genome-Scale Metabolic Models (GEMs). When environmental conditions are provided into our model packages, each package provides predictions of the performance of hundreds of chassis choices. In addition, we have created a web-based interface that is accessible for users who are less experienced with coding. This interface operates using the same environmental inputs and predictive outputs, in the form of relative abundances and growth rates, similar to our toolkit. However, this interface introduces limitations to the number of predictive systems it can consider, and limits mechanisms for interactivity and data delivery, decreasing accuracy and speed in many situations.
To design our program, we (1) used metagenomic data analysis to extract bacterial data from an existing data source, (2) developed prediction models using various techniques such as linear regression and machine learning, (3) designed GEM models to model the metabolic pathways in a given bacteria, and (4) created a gut microbiome database to help find the optimal bacterial chassis in human gut. In addition to designing our software program, we also (5) designed two genetic circuits which detect the transition of bacteria into stationary phase.
Methods
1. Metagenomic data analysis
To make predictions for relative abundance of different target bacterial chassis, chassEASE relies on both user-provided and generated information. First, we collected variables of environments based on the metadata parameters that a sample might have. For instance, soil samples may have a depth specification, and water samples may have a pH specification.
To make accurate predictions for each of these environment specifications, we processed large amounts of 16S metagenomic information from databases, individual studies, and fieldwork. These data sources exist in a raw form, and are assigned to different environments through either automatic assignment methods or manual filtering. In its original form, this data is computationally expensive to process due to extraneous information, incomplete data for many samples due to untracked metadata, and inability to use in its raw form due to inconsistent or unknown units and scales.
To resolve these problems, we process raw metagenomic data sources into files which we refer to as metagenomic subsets. These subsets are built using environment specifications, raw data sources, and lists of chassis to target in order to reduce datasets to only accurate, relevant information, needed in the implementation of the software. By default, our output is the list of the 500 most abundant genera, and 500 most abundant species.
Once metagenomic subsets are created, they are provided to the next stage of our software: predictive model training. As data has already been processed for consistency, predictive models can interpret the metadata and relative abundance values directly, eliminating fragility and fragmentation concerns surrounding independent processing for each model.
2. Modeling for relative abundance
To find the optimal chassis, it is necessary to know where a specific chassis is able to survive. Currently, there is lots of 16S data related to the relative abundance of different bacterial species in specific environments. Using this data, we can predict where specific bacterial species will survive using prediction models. Prediction models help us to recognize possible patterns between bacteria reads and different environmental conditions. For example, can the amount of E. coli present in an environment be related to the amount of nitrogen in that environment? A prediction model can help us to draw this conclusion.
The first step to developing these prediction models is to collect our data. We use GIS data as our main data source. We ended up collecting more than 30,000 samples with 84 predictors (x values) and 3 dependent variables (y values). To gain more information about our dataset, we used Principal Components Analysis to determine the relationship between our variables. We also used Linear Discriminant analysis to model the linear classifiers and extract the ability of each independent variable to predict the relative abundances of bacteria.
To determine which models our team should implement in our software, we underwent the model selection process. This workflow consists of looking at a dataset and brainstorming any potential models that can be used to fit the data. When we were looking at our dataset, there are a large number of vectors that impact our prediction of relative abundance. Once we noticed the large number of vectors, we decided to consider a supervised learning technique as one of the possible approaches that we included in our software. Supervised learning models usually contain complex algorithms that can recognize complicated patterns in a dataset.
To find the optimal models to use for our software we tested: neural networks, linear regression, KNN regression and Random Forest Regression. Before testing, we removed variables that according to our PCA results are not statistically significant. Next, for each model (neural networks, linear regression, KNN regression and Random Forest Regression), we used different error metrics to evaluate the model performance. We used Binary Cross-entropy to evaluate neural networks and R-squared and Mean Squared Error for KNN, linear regression and random forest. According to our results, all of the methods performed well. We found that the neural network was the most accurate model. In response, we made our neutral network the primary model in our software. In response to this data, we incorporated all of these models into our software.
3. Modeling for growth rate
Genome-scale metabolic models (GEMs) are stoichiometric models that incorporate genes, enzymes, reactions, and gene-protein reaction rules to model the metabolic pathways in a given bacteria. GEMs are one of the primary models to contextualize and use big data to more accurately describe the metabolism of bacteria. They are traditionally used in therapeutics as you can model the impact of a drug that inhibits the production of a certain metabolite and how that will affect cell metabolism. In fact for a certain E. coli strain, iML1515, the model has been tested to show a 93.4% accuracy for gene essentiality under suboptimal media conditions (Gu et al., 2019). In the context of fieldable synthetic biology, engineered cells will experience suboptimal exposure to metabolites, making GEMs a perfect option for modeling. Additionally, GEMs are very dynamic models with reactions and metabolites being constantly updated to published databases such as BIGG Models. As such, the ability to add and delete reactions is common and trivial, making it a welcoming interface for synthetic biologists who wish to incorporate reactions and gene-protein rules into their chassis of choice. However, due to their novelty, their use in synthetic biology has been minimal.
The first way we used GEMs was to predict biomass accumulation in given media conditions. One of GEMs’ largest strengths is their ability to simulate metabolic processes with limited metabolite concentrations. In our results, we assumed that the ability to properly metabolize in certain conditions would correlate to a higher abundance and ability to survive in the environment of choice.
In our software, a ranked output comes from a Flux Balance Analysis. The Flux Balance Analysis or FBA is the mathematical method that utilizes GEMs to produce an expected growth accumulation rate (Orth et al., 2010). It uses a set of reactions to represent the necessary functions of metabolism and maps them to the set of present metabolites. Through the cobra toolbox, we upload our desired GEM model and set metabolite concentrations. Then, using the FBA tool, we optimize for the objective function, which, in this case, is the rate at which metabolic compounds are converted into biomass components such as nucleic acids, proteins, and lipids (Orth et al., 2010).
The second way we used GEMs in our project was incorporating the Flux Balance Analysis technique above but for a modified GEM. We worked closely with GastonDay-Shangde iGEM who were engineering E. coli to convert L-phenylalanine to cinnamaldehyde. We added their reactions to the GEM model using the cobratoolbox and then conducted a Flux Balance Analysis to see how the newly added reactions would affect the cell growth (Cobrapy Core Team, 2019). We also added the ability of GastonDay-Shangde iGEM to control their media conditions, meaning they will be able to simulate growth in various soil or agar conditions. This was a collaboration and a proof of concept that the various strengths of GEMs can be combined to model synthetic biology in fieldable conditions. To access this full code, please visit our partnership page.
Gut Microbiome Database
The gut microbiome database serves as a database search tool that allows the user to determine the ideal chassis across different inidivual’s gut microbiomes. It enables the user to set parameters, age, country, BMI and sex, to determine which chassis will survive in the widest range of gut microbiomes. For example, if a researcher wants to develop a therapeutic for female children from the ages of 5-10, they can set our input parameters to age: 5-10 and sex: female. The software will then tell them bacterial species and genuses of bacteria that are most commonly found across individuals in these categories, regardless of other factors such as country of origin. It provides two separate lists: a list of the ten most dominant genus and a list of the ten most dominant species. Our input data was collected from studies conducted around the world. To develop such a search tool, an inclusive and diverse database is crucial: It must contain samples from different countries and age groups. We managed to use a data scraper by python and collect metadata from 353 projects, containing 71,642 runs (samples) in total from data available in the database GMrepo. Each sample has its country of origin, age, sex and BMI. We imported the dataset using python, along with other packages (panda, numpy, etc) essential for data processing. We used Python to develop a searching algorithm, which produces the subset of the dataset that meets the user’s input. The algorithm uses very basic Panda functions and if/else statements to select rows. For example, if a user’s input is a male from Japan with age from 17 to 60 and BMI in the range from 30 to 40.
The searching algorithm narrows down the dataset based on the input parameters. This narrowed down dataset is then fed into the ranking algorithm. Then the ranking algorithm extracts the species and genus data, counting the number of appearances across the individuals that meet these parameters. The program returns the top 10 most dominant species and genus.
Improvement of an Existing Part
Overview and Biological Relevance
The ability to assay whether a chassis is actively transcribing its circuit to make proteins is crucial for testing the efficacy of a fieldable construct. Bacteria have two main life states: exponential growth, during which they reproduce and express their circuits with ease, and stationary phase, during which they cease most non-essential metabolic activity. Since stationary phase is induced by inopportune environments, such as metabolite shortage, most bacteria in nature exist in stationary phase (Jaishankar 2000). This is a major problem for fieldable synthetic biology, as constructs that work perfectly in the lab may stop expressing their circuits when introduced into their deployment sites. In order to assay how a circuit will behave in nature, constructs should be tested in the lab while in stationary phase. To this end, William and Mary has designed two stationary phase detection constructs, using red and green fluorescence as outputs. The promoter used is an osmY promoter as this part is induced by the cell's entry into stationary phase (Chang 2002). These composite parts are an improvement of MIT iGEM 2006’s composite part BBa_J45995, and are included in the iGEM Registry as parts BBa_K4174001 (red fluorescence) and BBa_K4174002 (green fluorescence), where you can find their full sequences. For more information on our circuit construction process and experimental results, please see our Improve a Part page.
Improved osmY Stationary Phase Detection Constructs
- osmY-mRFP1 Stationary Phase Detection Construct (BBa_K4174001)
We improved this part by [a] offering a new fluorescence option to researchers by replacing GFP with monomeric red fluorescence protein (mRFP1) and [b] making the sequence compatible with a new assembly method (Type IIS) - Replacing GFP with mRFP1:
- Creating alternative versions of fluorescent bioreporters using proteins with different excitation and emission wavelengths allows researchers to assay multiple parameters at a time, as different fluorescent assays conducted simultaneously must use proteins with different absorption spectra in order for researchers to differentiate between them. We replaced GFP with mRFP1 in order to give researchers a red fluorescence protein for detecting stationary phase in bacteria. mRFP1 is a monomer, has a rapid maturity rate, and has minimal spectral overlap with GFP compared to the wild-type red fluorescent protein DsRed (Campbell et al. 2002). However, other DsRed variants have a much higher fluorescence quantum yield and extinction coefficient than mRFP1 (Campbell et al. 2002).
- Making the construct Type IIS assembly compatible:
- osmY-sfGFP Stationary Phase Detection Construct (BBa_K4174002)
We improved this part by [a] increasing fluorescence by replacing GFP with superfolder GFP (sfGFP), [b] removing non-functional scar sequences, and [c] making the sequence compatible with a new assembly method (Type IIS). - Replacing GFP with sfGFP and changing the RBS:
- Superfolder GFP (sfGFP) is an improved version of GFP which folds more readily and precisely in Escherichia coli, allowing for more efficient, bright, and accurate assays (Pédelacq 2006). We sourced our sfGFP sequence from Ceroni et al. (2015). This sfGFP sequence was designed for high-level expression in E. coli (Ceroni et al., 2015). In order to ensure that our ribosome binding sequence (RBS) was compatible with our coding region, we also replaced the original RBS with an RBS used by Ceroni et al. (2015) with sfGFP.
- Removing non-functional scar sequences:
- The original construct sequence has scar sequences present from assembly. These nonfunctional units were removed by our team as they are purely artifacts of biological assembly.
- Making the construct Type IIS assembly compatible:
- All constructs uploaded to the iGEM Registry must be compatible with either type IIS assembly or RFC 10 assembly. The original MIT iGEM 2006 construct was only compatible with RFC 10 assembly, but after altering the sequence as described above to create our composite part, it became compatible with both Type IIS assembly and RFC 10 assembly. Increasing the options for assembly methods compatible with our construct will help make its construction more accessible to researchers.
-
All constructs uploaded to the iGEM Registry must be compatible with either Type IIS assembly or RFC 10 assembly. The original MIT iGEM 2006 construct was only compatible with RFC 10, but after altering the sequence as described above to create our composite part, it is now compatible with both Type IIS assembly and RFC 10 assembly. Increasing the options for assembly methods compatible with our construct will help make its construction more accessible to researchers.
References
Campbell, R. E., Tour, O., Palmer, A. E., Steinbach, P. A., Baird, G. S., Zacharias, D. A., & Tsien, R. Y. (2002). A monomeric red fluorescent protein. Proceedings of the National Academy of Sciences of the United States of America, 99(12), 7877–7882. https://doi.org/10.1073/pnas.082243699
Ceroni, F., Algar, R., Stan, G., & Ellis, T. (2015). Quantifying cellular capacity identifies gene expression designs with reduced burden. Nature Methods, 12(5):415-418. Doi: 10.1038/nmeth.3339
Chang, D. E., Smalley, D. J., & Conway, T. (2002). Gene expression profiling of Escherichia coli growth transitions: an expanded stringent response model. Molecular microbiology, 45(2), 289-306.
Cobrapy Core Team. (2019). Documentation for COBRApy. Documentation for COBRApy - cobra 0.25.0 documentation. Retrieved October 11, 2022, from https://cobrapy.readthedocs.io/en/latest/
Gu, C., Kim, G.B., Kim, W.J., Kim, H. U., & Lee, S. Y. (2019). Current status and applications of genome-scale metabolic models. Genome Biol 20, 121. https://doi.org/10.1186/s13059-019-1730-3
Jaishankar, J., & Srivastava, P. (2017). Molecular basis of stationary phase survival and applications. Frontiers in microbiology, 8, 2000.
Pédelacq, J. D., Cabantous, S., Tran, T., Terwilliger, T. C., & Waldo, G. S. (2006). Engineering and characterization of a superfolder green fluorescent protein. Nature biotechnology, 24(1), 79-88.
Orth, J. D., Thiele, I., & Palsson, B. Ø. (2010). What is flux balance analysis?. Nature biotechnology, 28(3), 245–248. https://doi.org/10.1038/nbt.1614