Engineering Success

As part of our project development, our team went through several iterations of the Engineering Design Cycle (design → build → test → learn), when (1) building our Neural Network, (2) developing our regression, and (3) designing our circuits.

We underwent iterations of the Engineering Design Cycle at two different scales. We designed versions of the software package and project as a whole, before building testing and ultimately learning by changing the direction of our software based on computed results, but we also underwent micro-iterations of the Engineering Design Cycle, where each component of the software and models were built iteratively, with knowledge from previous designs informing the next.

At the core of our software and model development process, we created iterative revisions of key algorithms and packages in our software. For each of our predictive models, we designed initial exploratory algorithms, built agile implementations in R and python, evaluated these models by calculating the error between predictions and 16s data in metagenomic subsets, and then created new model revisions based on the information we learned. We repeated this process dozens of times, ultimately creating final versions of each model we use in the software.

To view our software toolkit, please visit the Gitlab: chassEASE.

Neural Network


The Neural Network/Artificial Intelligence predictive model is one of the core algorithms we employed to create accurate relative abundance predictions. We chose to follow a common design for class inference, the fully-dense Artificial Neural Network (ANN) design. As implied by the name, its structure is inspired by the human brain, mimicking the way biological neurons send signals to each other. We designed our initial ANNs consisting of a layer of nodes corresponding to inputs, 6 fully connected hidden layers, and an output layer. Nodes are interconnected with a series of weights and biases. We planned on using the Tensorflow library to build the initial ANN with a subset of predictors.

Design:

We used python and Tensorflow to 6 fully connected hidden layers, and an output layer. We used cross-entropy for our loss function and the Adam function as our optimizer.

Build:

We used python and Tensorflow to 6 fully connected hidden layers, and an output layer. We used cross-entropy for our loss function and the Adam function as our optimizer.

Test:

When training our neural network, we reached an accuracy of 71%. As shown in the graph, the accuracy of our model skyrocketed in the very early stages of our training process. Originally, in our dataset, to account for missing values, we replaced the blank measurements with zero. We suspect that this early jump in accuracy in the early iterations of our model was due to the large number of zero measurements in our initial dataset.

Learn:

We learned that the data plays a crucial role in training an accurate and generalizing model. To avoid the problem with missing data, more data processing needs to be done so that our dataset can be as representative of the real data as possible. We also tested more loss functions and optimizers to see how the results change, and we have found that the best results came from our original loss function (binary cross-entropy) and optimizer (adam).


Regression


Goal

The goal of the regression was to create an interpretable equation that predicts the relative abundance of bacteria in soil.

Iteration 1

Design:

In order to accurately predict the relative abundance of different bacterial genera, we built our initial regression to account for high degree polynomial terms and logarithmic independent variables. We hoped to find statistically significant coefficients that would allow us to compare how one unit change in the explanatory variable impacted the prediction of the dependant variable.

Build:

After our metadata table was imported to R studio we built the following code.


  >reg<-lm(Bacillus~poly(depth,4)+log(depth+.001)+poly(elevation,4)+log(elevation+.001)+poly(soil_temperature,4)
  +poly(ph,4)+log(ph+.001)+poly(respiration,4)+log(respiration+.001)+poly(total_carbon,4)+log(total_carbon+.001)
  +poly(organic_carbon,4)+log(organic_carbon+.001)+poly(total_nitrogen,4)+log(total_nitrogen+.001)+poly(carbon_nitrogen_ratio,4)
  +log(carbon_nitrogen_ratio+.001)+poly(aluminum,4)+log(aluminum+.001)+poly(soil_density,4)+log(soil_density+.001)
  +soil_sand+soil_clay+soil_silt, data=data)
  > summary(reg)

Test:

Learn/Rebuild:

This regression taught us that a large amount of our soil parameters had almost no statistical significance. Additionally, the regression was only able to account for 8% of the variance in prediction. There were a few conclusions we initially made. Either, the soil parameters had very little effect on the relative abundance of our bacterial genera or something was wrong with our data. After running this regression we looked back on the assumptions we were making about normality and decided to build a new regression that used the logarithm of the dependent variable. This adjustment helped to normalize the data and fix issues with heteroskedasticity.

Iteration 2

Design:

As we continued with our regressions we raised the sizes of the polynomial terms to account for higher derivative relationships. We knew that our data was slightly right skewed, so we decided to take the logarithm of our dependent variable. This helped mitigate error caused by lack of normality and homoscedasticity

Build:


  >reg <-lm(Bacillus~poly(depth,4)+log(depth+.001)+poly(elevation,4)+log(elevation+.001)+poly(soil_temperature,4)
  +poly(ph,4)+log(ph+.001)+poly(respiration,4)+log(respiration+.001)+poly(total_carbon,4)+log(total_carbon+.001)
  +poly(organic_carbon,4)+log(organic_carbon+.001)+poly(total_nitrogen,4)+log(total_nitrogen+.001)+poly(carbon_nitrogen_ratio,4)
  +log(carbon_nitrogen_ratio+.001)+poly(aluminum,4)+log(aluminum+.001)+poly(soil_density,4)+log(soil_density+.001)
  +soil_sand+soil_clay+soil_silt, data=data)
  >summary(reg)

Test:

Learn:

From this regression, we started to reduce the number of soil parameters that were consistently statistically relevant. This reduction led to interpretable coefficients in both sign and magnitude. We also realized that a huge source of error in previous regressions was due to the skewing of our data. By taking the logarithm of our dependent variables, we were able to greatly increase the accuracy of our regression. As the final step for the regression, we wanted to add interactive variables, automate the regression, and include species data in addition to genera.

Iteration 3

Design:

We wanted to add interactive variables, species data, and automate the regression simultaneously. We created a python script that would build our regression. The Python code looked at the data spreadsheet and created a list of all of the possible input variables. Then it started to build an R-studio file that would be run in R. The first step was to download the necessary packages into R. Then we started building the regression. We raised each term to the seventh order. In the final regression we declined to use the poly function because it is not functional with stepwise dimensionality reduction. We then added logarithmic mappings for all of the non negative soil parameters. Lastly, we created a list of all possible interactive variables. This was then iterated for the set of all genera and species .

It would take the set of all possible input variables and regresses to the 7th order, if the variable was negative it would be excluded from the logarithm mapping, and would then multiply each variable by each other to get all possible interactive variables. We also added a forward selection dimensionality reduction to the code so it would only output statistically relevant variables.

Build:

      
        import pandas as pd
      import subprocess

      from microbiome_analysis.target.soil_mlr_target import SoilMlrTarget

      target = SoilMlrTarget()
      target.setup()

      params_to_model = target.params_sheet.loc[target.params_sheet["Model"] & (target.params_sheet["Units"] != "enum"), :]["Parameter"]

      output_columns = []
      with open("soil_genus.txt") as f:
          for genus in f.readlines():
              output_columns.append(genus.strip()[:1].upper() + genus.strip()[1:].lower())
      with open("soil_species.txt") as f:
          for species in f.readlines():
              output_columns.append(species.strip()[:1].upper() + species.strip()[1:].lower())

      subset_location = "../../william-and-mary-software/soil_genera_processed.csv"
      subset_for_stats = pd.read_csv(subset_location)

      for bacteria in output_columns:
          if bacteria != "Burkholderia":
              continue

          new_loc = "droppedsheet.csv"
          subset_for_stats.loc[subset_for_stats[bacteria] != 0, :]\
              .dropna(subset=[bacteria]).to_csv(new_loc)

          r_script_base = f"""
          install.packages('pacman');
          install.packages('leaps');
          library(pacman);
          pacman::p_load(leaps,pacman,dplyr,GGally,ggplot2,ggthemes,ggvis,httr,plotly,rio,rmarkdown,shiny,stringr,tidyr,MASS);
          rio_CSV<-import('./{new_loc}');
          """
          terms = []

            for parameter in params_to_model:
                if 'cobalt' in parameter or 'water_holding' in parameter or 'phenol' in parameter:
                    continue

                terms.append(f"{parameter}")
                for j in range(2, 7+1):
                    terms.append(f"{parameter}^{j}")

                if subset_for_stats[parameter].min() >= 0:
                    if "soil_temp" in parameter or "environment_temp" in parameter:
                        continue
                    terms.append(f"log({parameter}+0.0001)")

                for interactive_param in params_to_model:
                    if parameter == interactive_param:
                        continue
                    if 'cobalt' in interactive_param or 'water_holding' in interactive_param or 'phenol' in interactive_param:
                        continue

                    interactive_term = f"{parameter}*{interactive_param}"
                    terms.append(interactive_term)

            r_script = r_script_base + f"""
        reg<-lm({bacteria}+0.0001~{"+".join(terms)},data=rio_CSV);

        final_model <- stepAIC(reg, direction="forward", trace=FALSE);

        summary(final_model);
        """

            #r_script = r_script.replace("\n", "")

            with open("train.r", 'w') as f:
                f.write(r_script)

            print(r_script)
            print(len(r_script))

            process = subprocess.Popen(["wsl", "Rscript", f"./train.r"],
                                   stdout=subprocess.PIPE,
                                   universal_newlines=True)

            while True:
                output = process.stdout.readline()
                try:
                    print(output.strip())
                except:
                    pass
                    # Do something else
        return_code = process.poll()
        if return_code is not None:
            print('RETURN CODE', return_code)
            # Process has finished, read rest of the output
            for output in process.stdout.readlines():
                try:
                    print(output.strip())
                except:
                    pass
            break

    exit()
      
      

Test:

Learn:

This is an output from Burkholderia, a random genera we selected to show the broad swath of our regression. We generated over 500 unique regressions for species and genera data of different accuracies and component variables. In this case, we have an adjusted R-squared of .15, meaning our metadata soil parameters were able to account for only 15% of the variance in Burkholderia relative abundance. This lower accuracy could be from a variety of possible errors, but most likely is due to the non-normality of the data or the incompleteness of our metadata. Additionally, due to the large set of input variables the forward and backward stepwise dimensionality reduction was not able to effectively reduce variables. This means we were unable to get interperatible coefficients when we were dealing with the larger metadata sets. In order to make assumptions about our soil parameters we reduced the number of interactive terms as well as input variables.

Improvement of an Existing Part



Overview and Biological Relevance

The ability to assay whether a chassis is actively transcribing its circuit to make proteins is crucial for testing the efficacy of a fieldable construct. Bacteria have two main life states: exponential growth, during which they reproduce and express their circuits with ease, and stationary phase, during which they cease most non-essential metabolic activity. Since stationary phase is induced by inopportune environments, such as metabolite shortage, most bacteria in nature exist in stationary phase (Jaishankar 2000). This is a major problem for fieldable synthetic biology, as constructs that work perfectly in the lab may stop expressing their circuits when introduced into their deployment sites. In order to assay how a circuit will behave in nature, constructs should be tested in the lab while in stationary phase. To this end, William and Mary has designed two stationary phase detection constructs, using red and green fluorescence as outputs. These composite parts are an improvement of MIT iGEM 2006’s composite part BBa_J45995, and are included in the iGEM Registry as parts BBa_K4174001 (red fluorescence) and BBa_K4174002 (green fluorescence), where you can find their full sequences. Below we have outlined information about [1] the original MIT iGEM 2006 composite part, [2] our improved parts, [3] our protocol for testing our constructs, and [4] results demonstrating our constructs exhibiting stronger fluorescence than the original construct.



Design:


osmY-sfGFP Stationary Phase Detection Construct (BBa_K4174002)

We improved this part by [a] increasing fluorescence by replacing GFP with superfolder GFP (sfGFP), [b] removing non-functional scar sequences, and [c] making the sequence compatible with a new assembly method (Type IIS).

  • Replacing GFP with sfGFP and Changing the RBS
    • Superfolder GFP (sfGFP) is an improved version of GFP which folds more readily and precisely in Escherichia coli, allowing for more efficient, bright, and accurate assays (Pédelacq 2006). We sourced our sfGFP sequence from Ceroni et al. (2015). This sfGFP sequence was designed for high-level expression in E. coli (Ceroni et al. 2015). In order to ensure that our ribosome binding sequence (RBS) was compatible with our coding region, we also replaced the original RBS with an RBS used by Ceroni et al. (2015) with sfGFP.
  • Removing Non-functional Scar Sequences
    • The original construct sequence has scar sequences present from assembly. These nonfunctional units were removed by our team as they are purely artifacts of biological assembly.
  • Making the Construct Type IIS Assembly Compatible
    • All constructs uploaded to the iGEM Registry that are considered for the iGEM competition must be compatible with either Type IIS Assembly or RFC 10 Assembly. The original MIT iGEM 2006 construct was only compatible with RFC 10, but after altering the sequence as described above to create our composite part, it is now compatible with both Type IIS Assembly and RFC 10 Assembly. Increasing the options for assembly methods compatible with our construct will help make its construction more accessible to researchers.

osmY-mRFP1 Stationary Phase Detection Construct (BBa_K4174001)

We improved this part by [a] offering a new fluorescence option to researchers by replacing GFP with monomeric red fluorescence protein (mRFP1) and [b] making the sequence compatible with a new assembly method (Type IIS).

  • Replacing GFP with mRFP1
    • Creating alternative versions of fluorescent bioreporters using proteins with different excitation and emission wavelengths allows researchers to assay multiple parameters at a time, as different fluorescent assays conducted simultaneously must use proteins with different absorption spectra in order for researchers to differentiate between them. We replaced GFP with mRFP1 in order to give researchers a red fluorescence protein for detecting stationary phase in bacteria. mRFP1 is a monomer, has a rapid maturity rate, and has minimal spectral overlap with GFP compared to the wild-type red fluorescent protein DsRed (Campbell et al. 2002). However, other DsRed variants have a much higher fluorescence quantum yield and extinction coefficient than mRFP1 (Campbell et al. 2002).
  • Making the Construct Type IIS Assembly Compatible
    • All constructs uploaded to the iGEM Registry that are considered for the iGEM competition must be compatible with either Type IIS Assembly or RFC 10 Assembly. The original MIT iGEM 2006 construct was only compatible with RFC 10 Assembly, but after altering the sequence as described above to create our composite part, it became compatible with both Type IIS Assembly and RFC 10 Assembly. Increasing the options for assembly methods compatible with our construct will help make its construction more accessible to researchers.


Build:

In order to assemble the original MIT iGEM 2006 construct along with our improved constructs, we selected to use Gibson Assembly. The inserts were synthesized by IDT and were flanked by unique nucleotide sequences (UNSs) 1 and 10 (Torella et al. 2014). Our backbone, pSB1C3, was also flanked by UNSs 1 and 10. We isolated this backbone from a 2018 William and Mary iGEM construct using PCR primers that anneal to the UNS sequences, then performed a gel extraction and PCR purification before utilizing the backbone for Gibson Assembly. Please download our protocols for PCR, gel extraction, PCR purification, transformation of E. coli and Gibson Assembly here:

Download Protocols Below:



Test:


Protocol for Testing Parts

To test the effectiveness of our improved parts, our team grew all three constructs in E. coli NEB5-α cells in our plate reader. They were grown at 37°C, and were continuously shaking in a plate reader (with OD600 and fluorescence measurements taken every 10 minutes) with the following exceptions: the cultures were not shaking while measurements were taken, and the cultures were placed at around room temperature outside the plate reader for about two hours during the experiment (during which no measurements were taken). Although the cultures were grown for around a total of 19 hours and 35 minutes, the graphs shown below only include measurements taken starting from around the 10 hour and 5 minute mark. In addition, the graphs below represent the averages of fluorescence measurements (normalized to OD600) taken from two biological replicates of our experiment. However, the inoculation process differed between these two replicates. For one of our replicates, we diluted a culture grown overnight in 4 mLs of LB to an OD of 0.1, then loaded the culture into wells to grow overnight. For the other replicate, we inoculated into 1 mL of culture, waited roughly 30 minutes, and loaded the culture into the well plates to grow. For both replicates, 200 uLs of culture was loaded into each well. Also, for both replicates, green fluorescence was measured using an excitation wavelength of 485 nm and an emission wavelength of 528 nm, and red fluorescence was measured using an excitation wavelength of 584 nm and an emission wavelength of 610 nm. Please download our raw fluorescence data and protocols for preparation of LB media, inoculation, and plate reader testing below.

Download Raw Fluorescence Data Below:

Download Protocols Below:

  • Preparation of LB Media Protocol
  • Inoculation Protocol
  • Plate Reader Testing Protocol


    Results


    BBa_K417002 Characterization (osmY-sfGFP Construct)

    To test the effectiveness of our osmY-sfGFP construct (BBa_K4174002), our team transformed the original MIT iGEM 2006 construct, our improved osmY-sfGFP construct, and our improved osmY-mRFP1 construct (BBa_K4174001) in E. coli NEB5-α in a plate reader. The various transformants were grown at 37°C. For green fluorescence, we used an excitation wavelength of 485 nm and an emission wavelength of 528 nm. The values for green fluorescence intensity normalized to OD600 values are reported below.

    Graph Label: Please note that the data represented in the graph above only includes measurements taken starting from around the 10 hour and 5 minute mark (out of a total growth time of about 19 hours and 35 minutes). In addition, the data shown represents the averages of fluorescence measurements (normalized to OD600) taken from two biological replicates of our experiment. Please note that the inoculation process differed between these two replicates (see the protocol section of this page for details)

    As seen in the graph above, both the osmY-sfGFP and MIT iGEM 2006 GFP constructs enter stationary phase right before 14 hours, but our improved osmY-sfGFP construct is much more fluorescent. The other constructs are our osmY-RFP construct and untransformed E. coli cells, both of which serve as negative controls for green fluorescence.

    As seen in the image above, qualitative results reveal that our improved constructs are more fluorescent than the original construct. Here, bacterial cells transformed with our osmY-sfGFP construct are on the far right, and are visibly more green than cells transformed with the original MIT iGEM 2006 construct.


    BBa_K417001 Characterization (osmY-mRFP1 Construct)

    To test the effectiveness of our osmY-mRFP1 construct (BBa_K4174001), our team transformed the original MIT iGEM 2006 construct (BBa_J45995), our osmY-mRFP1 construct, and our osmY-sfGFP construct (BBa_K4174001) into E. coli NEB5-α cells and grew the various transformants in a plate reader. For red fluorescence measurements, we used an excitation wavelength of 584 nm and an emission wavelength of 610 nm. The values for red fluorescence intensity normalized to OD600 values are reported below.

    Graph Label: Please note that the data represented in the graph above only includes measurements taken starting from around the 10 hour and 5 minute mark (out of a total growth time of about 19 hours and 35 minutes). In addition, the data shown represents the averages of fluorescence measurements (normalized to OD600) taken from two biological replicates of our experiment. Please note that the inoculation process differed between these two replicates (see the protocol section of this page for details).

    Based on the graph above, the bacterial cells engineered with our osmY-mRFP1 construct appear to have entered stationary phase around 16 hours. As seen in the graph, our osmY-mRFP1 construct produces more red fluorescence than the original construct (as expected since the original construct was a GFP construct). The other measurements taken are for our osmY-sfGFP construct and untransformed E. coli NEB5-α cells, both of which serve as negative controls for red fluorescence.

    The fluorescence intensity of the untransformed cells appears higher than that of cells transformed with our osmY-mRFP1 construct due to our normalization process. Although the raw fluorescence values of the untransformed cells were consistently lower than the raw fluorescence values of the osmY-mRFP1 circuit, the OD600 values of the untransformed cells were much lower than the OD600 values of the cells transformed with osmY-mRFP1. Therefore, when we normalized by dividing raw fluorescence by OD600, the fluorescence intensity of the untransformed cells appeared to be higher than that of the osmY-mRFP1 transformants.

    As seen in the image above, qualitative results reveal that our osmY-mRFP1 construct produces more red fluorescence than the original construct. Here, bacterial cells transformed with our osmY-mRFP1 construct are on the far left, and are visibly red, unlike the other two cultures transformed with the osmY-sfGFP and MIT iGEM 2006 constructs.


    Learn:

    Through our testing process, we learned that these two combinations of parts lead to an optimized construct.



References



Campbell, R. E., Tour, O., Palmer, A. E., Steinbach, P. A., Baird, G. S., Zacharias, D. A., & Tsien, R. Y. (2002). A monomeric red fluorescent protein. Proceedings of the National Academy of Sciences of the United States of America, 99(12), 7877–7882. https://doi.org/10.1073/pnas.082243699

Ceroni, F., Algar, R., Stan, G., & Ellis, T. (2015). Quantifying cellular capacity identifies gene expression designs with reduced burden. Nature Methods, 12(5):415-418. Doi: 10.1038/nmeth.3339

Chang, D. E., Smalley, D. J., & Conway, T. (2002). Gene expression profiling of Escherichia coli growth transitions: an expanded stringent response model. Molecular microbiology, 45(2), 289-306.

Jaishankar, J., & Srivastava, P. (2017). Molecular basis of stationary phase survival and applications. Frontiers in microbiology, 8, 2000.

Pédelacq, J. D., Cabantous, S., Tran, T., Terwilliger, T. C., & Waldo, G. S. (2006). Engineering and characterization of a superfolder green fluorescent protein. Nature biotechnology, 24(1), 79-88.

Torella, J. P., Boehm, C. R., Lienert, F., Chen, J. H., Way, J. C., & Silver, P. A. (2014). Rapid construction of insulated genetic circuits via synthetic sequence-guided isothermal assembly. Nucleic acids research, 42(1), 681–689. doi.org/10.1093/nar/gkt860