Overview

The goal of this year’s Part Improvement was the adaptation of the spacer sequence between the Ribosome Binding Site and the Start Codon in order to reduce the context dependency of a commonly used RBS, BBa_B0034. Ribosome Binding Sites are strongly influenced by their surrounding sequences and, therefore, such adaptations in these spacer regions might tune their function. iGEM RBS parts are distributed with a specific 3’ spacer scar and, by modifying this spacer a new improved part, BBa_K4294559, was built and characterized regarding its context dependency.

Background



In general, our project revolves widely around the characteristics of a Ribosome Binding Site (RBS) sequence. As it is often stated in the rest wiki pages, Ribosome Binding Sites are highly context dependent. Even though their intrinsic Shine-Dalgarno sequence plays an important role in defining their performance, the neighboring genetic context, namely their neighboring sequences, influence their function. For example, transcribed operator sequences or coding sequences (CDS) near the start codon that are complementary to the RBS might trap the RBS in a stem-loop, inhibiting ribosome binding and, therefore, protein expression [1]. In general, stable mRNA structures at the RBS are limiting the translation initiation rate. The same RBS might function considerably differently across varying genetic contexts. This makes RBS characterizations often unreliable and since the RBS strength at initiating translation must be evaluated in every new designed circuit, RBS parts must be improved in terms of their context dependency to become true versatile, plug-and-play devices.



Many different approaches have been developed to combat this issue. Regarding the Promoter:5’UTR junction context, a few examples include self-cleaving ribozymes, which provide a standard genetic context at the 5’ end and can insulate the RBS from the Promoter’s context [2] and unstructured 5’ tails acting as ribosome “standby sites”, which facilitate translation initiation by pre-bound to the mRNA ribosomes at the short frames of time in which the stable mRNA structure is transiently unfolded [3]. Regarding the 5’UTR: CDS junction, the often-mentioned bicistronic design (BCD) by Mutalik et. [4] al provides an efficient way to combat variability in protein expression.



Prokaryotic RBS parts are often registered as sequences mostly consisting of a Shine Dalgarno sequence (the complementary to the 16S rRNA sequence). However, all of the above indicates that their nearby genetic context also needs to be defined. iGEM RBS parts are distributed with an extra 5’ biobrick and 3’ biobrick scar which influence their function and probability of interaction with nearby sequences.



In a recent research paper by Duan et. al [5], general design criteria to improve programmability and minimize context dependence of RBS were proposed. The translation of multiple genes under the same RBS was quantitatively characterized to evaluate the context dependence of each RBS variant in E. coli. Among other parameters of RBS function that were evaluated, their results showed that an AC-rich spacer region (−7 to −1) between the SD sequence and the start codon was associated with lower context dependence.

Based on the aforementioned research we developed a version of the commonly used BBa_B0034 RBS part with an altered spacer sequence between the RBS and the ATG start codon.


In Silico Analysis



Unsupervised Machine Learning


The goal to reduce the variability of the RBS BBa_B0034 via redesigning the spacer region would mean that our team should locate spacer mutants for which we could have quantitative results regarding their context dependency. That is no easy task and would normally require a large amount of experimental measurements with different spacer variants and genes of interest in order to create a large dataset. Of course, during the iGEM short time window, such a procedure would be impossible in parallel with the rest of the project experiments. As a result, our team decided to use an already tested-in-lab dataset and the different features of its input sequences in order to determine the context dependency of our novel RBS sequences.



The dataset was provided in the research paper by Duan et. al.[5]. In particular, they characterized a library of RBS mutants in the context of GFP and RFP fluorescent proteins. In this way, they created a labeled dataset of RBS sequences in which RBS variants with similar values of GFP and RFP fluorescence(< 20% difference) would be characterized as low context dependent (LCD) and the others would be labeled as high context dependent(HCD).



Simultaneously they also performed a feature analysis in order to determine the exact factors that play a more significant role in the context dependency of their algorithm and put an emphasis on the nucleotide sequence not only of the entire RBS sequence,but also of the RBS - Start Codon spacer region in particular, as well as characteristics of the secondary structure, such as MFE, and the binding affinity between RBS and 16S rRNA.After correlation analysis, they concluded to a 27-long vector of RBS characteristics which showed important statistical correlation regarding protein translation rates (via the Mann-Whitney Test [6]).



In the above workflow, they created a dataset that was unbalanced, meaning that the more context independent sequences were significantly fewer than the ones showing higher context independency (1:8 ratio). That type of dataset is a prime candidate for unsupervised learning models, which are capable of detecting outlier candidates. That is due to the fact that the given training model is already unbalanced, meaning that it is tuned to only detect normal cases. In the situation where the model is presented with a novel anomaly case (an LCD RBS in our case), it will create an anomaly score leading to the detection of it.



In terms of our workflow, we used the already characterized dataset provided by Duan et. al [5] and trained it for many different unsupervised learning algorithms via the pycaret library [7] . The only main difference is the binding affinity metric which we noticed that it couldn’t be replicated by any metric we tested on the Vienna RNA package. So, instead we used the pf_fold() function and in particular the ensemble energy variable it gives as an output which is proportional to the binding affinity metric used in the paper[8]. This was done not only for our own testing sequences but also for the original dataset ones.



Figure 1. Different Outlier Detection algorithms offered via the pycaret package

The anomaly detection module of pycaret ranks trained and tested sequences based on an anomaly metric from 0 to 1, with the anomaly LCD values being closer to an anomaly score of 1. The 4 learning models used for our analysis via the pycaret library are briefly presented below.



1. Principal Component Analysis (PCA)


PCA is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data [9]. Formally, PCA is a statistical technique for reducing the dimensionality of a dataset.



Figure 2. Brief explanation of PCA Algorithm [10].

In terms of using it in unsupervised learning, PCA can detect outlier inputs via calculating the euclidean distance between the novel input vector and the reduced space of the dataset provided by PCA. A high distance means a high reconstruction error and thus an anomaly score closer to one.



Figure 3. Example of using PCA analysis for 2 points; one close to the reduced space and one away from it. The reconstruction error is represented by dotted lines between the new and the reconstructed points


2. One Class Support Vector Machine (One-class SVM)


One-class SVM is a special case of SVMs, which are normally supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. In brief, SVM maps training examples to points in space so as to maximize the width of the gap between the various categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall. The kernel separating the different labels could be either linear or non-linear, as shown below.


Figure 4. Example of linear hyperplane produced by SVM in order to distinct classes 1 and 2 [11].

As for one-class SVM, it is a variance of the original SVM algorithm which has as a target to find the hyperplane is as close to the original dataset points as possible [12]. As a result, when a novel point is presented that is away from the hyperplane, then one-class SVM detects it as an anomaly



3. K-nearest neighbors (knn)


The knn algorithm is a supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point The algorithm computes the euclidean distance of a novel point in order to compute its k-nearest neighbors and then use the neighbors labels in order to determine the new points’ label.


Figure 5. Classification example using knn for a novel point [13].


4. Isolation Forest Algorithm (IForest)


Isolation forest or IForest is an anomaly detection algorithm that uses isolation (how far a data point is to the rest of the data), rather than modeling the normal points Isolation forests are based on decision trees; randomly sub-sampled data is processed in a tree structure based on randomly selected features. he samples that travel deeper into the tree are less likely to be anomalies as they require more cuts to isolate them. Similarly, the samples which end up in shorter branches indicate anomalies as it was easier for the tree to separate them from other observations [14].


Figure 6. Visual Representation of iForest for anomaly detection [14].


After selecting the algorithms described above, then we trained the initial dataset for each one of those algorithms and saved the models that were produced on the .pkl files. Simultaneously we created our own muted spacer sequences on the RBS_BBa0034 variants-100 with a spacer length 5 and 100 more with a spacer length 6- and tested the 4 different trained models in order to rank the novel RBS variants in terms of “anomaly”. Then kept the top 20 variants from each model, i.e. the 20 sequences with the highest anomaly scores and then kept the variables that were present on all 4 “top 20” variant lists.


The resulted sequences from the in silico analysis are presented on table 1


Index Name Sequence
1 RBS_22_spacer_5_vol_5 AGAGGAGAAAGCCCCA
2 RBS_22_spacer_5_vol_96 AGAGGAGAAACCTCCA
3 RBS_22_spacer_6_vol_29 GAGGAGAAACATCCCA
4 RBS_22_spacer_6_vol_47 GAGGAGAAACCCGCAA
5 RBS_22_spacer_6_vol_59 GAGGAGAAACCCTCCA
Table 1. Final RBS BBa_B0034 originated parts with mutant spacers, after the intersection of all 4 ML models. With bold, we highlight the 5 and 6 nucleotides long spacer regions that were randomly generated (via the random_seq_generator.ipynb script).


Due to a lack of time for testing all 5 sequences, the wet lab team initially arbitrarily decided on testing 3 variants. Since all variants have the same AC content and one T or C in the spacer region, the T containing variants were selected on the grounds of the rough estimation that an AT hydrogen bond with a neighboring sequence would offer less stability than a stronger GC bond. Hence, we were led to choose RBS variants with index values 2,3,5 on table 1 for experimental validation of our in silico analysis. In the end, due to time and wet lab engineering constraints, only the RBS variant with index value 5 (V59) was tested in comparison with the original BBa_B0034 RBS.


For a better view of the generated scripts and analysis, please refer to our improve-a-part repository on gitlab


Wet Lab Validation


The following RBS variants were selected from the in silico analysis (Table 2.)


RBS Name 5' Scar BBa_B0034 part sequence RBS - Start Codon Spacer Part Code
V59 tactagag aaagaggagaaa ccctcca BBa_K4294559
V96 tactagag aaagaggagaaa cctcca BBa_K4294596
V29 tactagag aaagaggagaaa catccca BBa_K4294629
Elowitz RBS tactagag aaagaggagaaa tactaa BBa_B0034
Table 2. The selected RBS sequences. The original RBS BBa_B0034 is also added.


To verify the reduction in context dependency of the improved RBS spacer variants, different genetic contexts were introduced by fusions of the fluorescent protein sfGFP with the first 36 nucleotides of the proteins Penicillin acylase (PA), Phosphomevalonate Kinase (PMK) and AraC. The sequences of the PA and PMK 36nt fragments were derived from Duan et. al [5]paper data and the sequence of the sfGFP and AraC CDS from Mutalik et. al [4] paper data. sfGFP retains its function when it is fused with poorly folded peptides[16] and is therefore a reliable reporter for examining the effect of the genetic context on translation rate. The following characterization parts were generated (Table 3.)


Reporter Part Code
36nt PA - sfGFP fusion Protein BBa_K4294203
36nt PMK -sfGFP fusion Protein BBa_K4294204
36nt AraC - sfGFP fusion Protein BBa_K4294205
Table 3. sfGFP fusions with the first 36 nucleotides of other coding sequences.


The above parts together with the CDS of sfGFP (BBa_K429420) create four genetic contexts in which the original BBa_B0034 RBS and the V59 RBS BBa_K4294559 variant were tested. Unfortunately the 36ntPMK - sfGFP fusion did not produce any fluorescence in both groups and therefore the context dependency was evaluated in three genetic contexts.


Overview of the experimental procedure



As it is explained in more detail in the attached pdf file below, only the V59 BBa_K4294559 RBS was finally tested in comparison with BBa_B0034. The V59 RBS part and the 36nt Double Stranded Oligo parts were constructed via primer annealing and extension. BsaI cut sites were introduced to facilitate the assembly of the characterization constructs following the Golden Gate Assembly Protocol for BsaI. The plasmid vector was the pTU1-A-lacZ plasmid that was widely used in our project. The Type2S cloning with BsaI compatible RBS BBa_B0034 was provided by the iGEM 2022 distribution kit plate 1. The transcription of the reporters was put under the control of the Ptet promoter and the final constructs were transformed into DH5a-z1 cells, which have genome encoded TetR production. Induction of bacteria was performed with 0.5 μM aTc concentration for 3h. OD and Fluorescence measurements were taken using the Flexstation 3 (Molecular Devices) plate reader. The Fluorescence (in RFU)/OD600 ratio was calculated for each reporter construct, the average value and the standard sample deviation of the ratios of all constructs for both RBS variants (original and v59) were calculated. Comparison of context dependence was evaluated via the calculation of the Coefficient of Variation of each group.


More information regarding the construct validation and measurement results are provided in this pdf file




Results



The following figures and table summarize the experimental data regarding the Fluorescence/OD measurements of each reporter under translational regulation of the original BBa_B0034 RBS and the V59 RBS BBa_K4294559.


Figure 7. Bar Plot of the Fluorescence/OD (in arbitrary units) ratios of the different reporters under the translational control of the BBa_B0034 RBS after a 3h induction with 0.5 μM of aTc. The 36nt PMK - sfGFP fusion was proven not to function in both RBS groups and was excluded from the CV calculation.

Figure 8. Bar Plot of the Fluorescence / OD (in arbitrary units) ratios of the different reporters under the translational control of the V59 BBa_K4294559 RBS after a 3h induction with 0.5 μM of aTc. The 36nt PMK - sfGFP fusion was proven not to function in both RBS groups and was excluded from the CV calculation.

B0034 v59
Average 4507.915552 Average 1996.770252
Standard Deviation 1532.370964 Standard Deviation 397.9864727
CV 0.3399289419 CV 0.1993151051
Table 4. Average, Standard Sample Deviation and CV values.

With the V59 RBS the Coefficient of Variance regarding the output signal is reduced by ~14%. The total output is also decreased; it is however in the same order of magnitude in comparison with the output of the BBa_B0034 regulated reporters.


Conclusions

The BBa_B0034 RBS was improved regarding its context dependence by the alteration of the RBS - Start Codon spacer region and the creation of the V59 BBa_K4294559. An obvious caveat regarding our results is the limited sample of the tested genetic contexts. However, since the choice of the potential spacer regions was supported by the in silico analysis, we believe that the V59 RBS is indeed an attractive alternative of the BBa_B0034 RBS and we encourage future iGEM teams in further characterizing it in new genetic contexts alongside with the other RBS candidate sequences that were not tested by our team. The potential reported lower translation rate could be alleviated by transcriptional enhancers [17] leading to the reconstruction of existing RBS parts that will achieve similar average translation rates in comparison with their respective original parts, but with a lower context dependence.

1. de Smit MH, van Duin J. Secondary structure of the ribosome binding site determines translational efficiency: a quantitative analysis. Proc Natl Acad Sci U S A. 1990 Oct;87(19):7668-72. doi: 10.1073/pnas.87.19.7668. PMID: 2217199; PMCID: PMC54809. 2. Lou C, Stanton B, Chen YJ, Munsky B, Voigt CA. Ribozyme-based insulator parts buffer synthetic circuits from genetic context. Nat Biotechnol. 2012 Nov;30(11):1137-42. doi: 10.1038/nbt.2401. Epub 2012 Oct 3. PMID: 23034349; PMCID: PMC3914141. 3. Sterk M, Romilly C, Wagner EGH. Unstructured 5'-tails act through ribosome standby to override inhibitory structure at ribosome binding sites. Nucleic Acids Res. 2018 May 4;46(8):4188-4199. doi: 10.1093/nar/gky073. PMID: 29420821; PMCID: PMC5934652. 4.Mutalik VK, Guimaraes JC, Cambray G, Lam C, Christoffersen MJ, Mai QA, Tran AB, Paull M, Keasling JD, Arkin AP, Endy D. Precise and reliable gene expression via standard transcription and translation initiation elements. Nat Methods. 2013 Apr;10(4):354-60. doi: 10.1038/nmeth.2404. Epub 2013 Mar 10. PMID: 23474465. 5. Duan Y, Zhang X, Zhai W, Zhang J, Zhang X, Xu G, Li H, Deng Z, Shi J, Xu Z. Deciphering the Rules of Ribosome Binding Site Differentiation in Context Dependence. ACS Synth Biol. 2022 Aug 19;11(8):2726-2740. doi: 10.1021/acssynbio.2c00139. Epub 2022 Jul 25. PMID: 35877551. 6. Mann–Whitney U test [Accessed 10/10/2022] 7. PyCaret software tool. https://pycaret.org/ [Accessed 10/10/2022] 8.RNAfold software tool. https://www.tbi.univie.ac.at/RNA/RNAfold.1.html [Accessed 12/10/2022] 9. Principal component analysis. 10.https://devopedia.org/principal-component-analysis [Accessed 9/10/22] 11.https://www.sciencedirect.com/topics/computer-science/support-vector-machine [Accessed 9/10/22] 12. Kun-Lun Li, Hou-Kuan Huang, Sheng-Feng Tian, Wei Xu. 2003 Improving one-class SVM for anomaly detection.Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693) 13.Bernhard Schölkopf, Robert Williamson, Alex Smola, John Shawe-Taylor, and John Platt. 1999. Support vector method for novelty detection. In Proceedings of the 12th International Conference on Neural Information Processing Systems (NIPS'99). MIT Press, Cambridge, MA, USA, 582–588. 14. https://www.ibm.com/topics/knn [Accessed 10/10/22] 15. Regaya, Y., Faldi, F., Amire, A., Point-Denoise: Unsupervised outlier detection for 3D point clouds enhancement. July 2021 DOI: 10.1007/s11042-021-10924-x 16.Pédelacq JD, Cabantous S, Tran T, Terwilliger TC, Waldo GS. Engineering and characterization of a superfolder green fluorescent protein. Nat Biotechnol. 2006 Jan;24(1):79-88. doi: 10.1038/nbt1172. Epub 2005 Dec 20. Erratum in: Nat Biotechnol. 2006 Sep;24(9):1170. PMID: 16369541. 17.Takahashi S, Furusawa H, Ueda T, Okahata Y. Translation enhancer improves the ribosome liberation from translation initiation. J Am Chem Soc. 2013 Sep 4;135(35):13096-106. doi: 10.1021/ja405967h. Epub 2013 Aug 22. PMID: 23927491.