The goal of this year’s Part Improvement was the adaptation of the spacer sequence between the Ribosome Binding Site and the Start Codon in order to reduce the context dependency of a commonly used RBS, BBa_B0034. Ribosome Binding Sites are strongly influenced by their surrounding sequences and, therefore, such adaptations in these spacer regions might tune their function. iGEM RBS parts are distributed with a specific 3’ spacer scar and, by modifying this spacer a new improved part, BBa_K4294559, was built and characterized regarding its context dependency.
In general, our project revolves widely around the characteristics of a Ribosome Binding Site (RBS) sequence. As it is often stated in the rest wiki pages, Ribosome Binding Sites are highly context dependent. Even though their intrinsic Shine-Dalgarno sequence plays an important role in defining their performance, the neighboring genetic context, namely their neighboring sequences, influence their function. For example, transcribed operator sequences or coding sequences (CDS) near the start codon that are complementary to the RBS might trap the RBS in a stem-loop, inhibiting ribosome binding and, therefore, protein expression [1]. In general, stable mRNA structures at the RBS are limiting the translation initiation rate. The same RBS might function considerably differently across varying genetic contexts. This makes RBS characterizations often unreliable and since the RBS strength at initiating translation must be evaluated in every new designed circuit, RBS parts must be improved in terms of their context dependency to become true versatile, plug-and-play devices.
Many different approaches have been developed to combat this issue. Regarding the Promoter:5’UTR junction context, a few examples include self-cleaving ribozymes, which provide a standard genetic context at the 5’ end and can insulate the RBS from the Promoter’s context [2] and unstructured 5’ tails acting as ribosome “standby sites”, which facilitate translation initiation by pre-bound to the mRNA ribosomes at the short frames of time in which the stable mRNA structure is transiently unfolded [3]. Regarding the 5’UTR: CDS junction, the often-mentioned bicistronic design (BCD) by Mutalik et. [4] al provides an efficient way to combat variability in protein expression.
Prokaryotic RBS parts are often registered as sequences mostly consisting of a Shine Dalgarno sequence (the complementary to the 16S rRNA sequence). However, all of the above indicates that their nearby genetic context also needs to be defined. iGEM RBS parts are distributed with an extra 5’ biobrick and 3’ biobrick scar which influence their function and probability of interaction with nearby sequences.
In a recent research paper by Duan et. al [5], general design criteria to improve programmability and minimize context dependence of RBS were proposed.
The translation of multiple genes under the same RBS was quantitatively characterized to evaluate the context dependence of each RBS variant in
Based on the aforementioned research we developed a version of the commonly used
BBa_B0034 RBS part with an altered spacer sequence between the RBS and the ATG start codon.
The goal to reduce the variability of the RBS BBa_B0034 via redesigning the spacer region would mean that our team should locate spacer mutants for which we could have quantitative results regarding their context dependency. That is no easy task and would normally require a large amount of experimental measurements with different spacer variants and genes of interest in order to create a large dataset. Of course, during the iGEM short time window, such a procedure would be impossible in parallel with the rest of the project experiments. As a result, our team decided to use an already tested-in-lab dataset and the different features of its input sequences in order to determine the context dependency of our novel RBS sequences.
The dataset was provided in the research paper by Duan et. al.[5]. In particular, they characterized a library of RBS mutants in the context of GFP and RFP fluorescent proteins. In this way, they created a labeled dataset of RBS sequences in which RBS variants with similar values of GFP and RFP fluorescence(< 20% difference) would be characterized as low context dependent (LCD) and the others would be labeled as high context dependent(HCD).
Simultaneously they also performed a feature analysis in order to determine the exact factors that play a more significant role in the context dependency of their algorithm and put an emphasis on the nucleotide sequence not only of the entire RBS sequence,but also of the RBS - Start Codon spacer region in particular, as well as characteristics of the secondary structure, such as MFE, and the binding affinity between RBS and 16S rRNA.After correlation analysis, they concluded to a 27-long vector of RBS characteristics which showed important statistical correlation regarding protein translation rates (via the Mann-Whitney Test [6]).
In the above workflow, they created a dataset that was unbalanced, meaning that the more context independent sequences were significantly fewer than the ones showing higher context independency (1:8 ratio). That type of dataset is a prime candidate for unsupervised learning models, which are capable of detecting outlier candidates. That is due to the fact that the given training model is already unbalanced, meaning that it is tuned to only detect normal cases. In the situation where the model is presented with a novel anomaly case (an LCD RBS in our case), it will create an anomaly score leading to the detection of it.
In terms of our workflow, we used the already characterized dataset provided by Duan et. al [5] and trained it for many different unsupervised learning algorithms via the pycaret library [7] . The only main difference is the binding affinity metric which we noticed that it couldn’t be replicated by any metric we tested on the Vienna RNA package. So, instead we used the pf_fold() function and in particular the ensemble energy variable it gives as an output which is proportional to the binding affinity metric used in the paper[8]. This was done not only for our own testing sequences but also for the original dataset ones.
The anomaly detection module of pycaret ranks trained and tested sequences based on an anomaly metric from 0 to 1, with the anomaly LCD values being closer to an anomaly score of 1. The 4 learning models used for our analysis via the pycaret library are briefly presented below.
PCA is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data [9]. Formally, PCA is a statistical technique for reducing the dimensionality of a dataset.
In terms of using it in unsupervised learning, PCA can detect outlier inputs via calculating the euclidean distance between the novel input vector and the reduced space of the dataset provided by PCA. A high distance means a high reconstruction error and thus an anomaly score closer to one.
One-class SVM is a special case of SVMs, which are normally supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. In brief, SVM maps training examples to points in space so as to maximize the width of the gap between the various categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall. The kernel separating the different labels could be either linear or non-linear, as shown below.
As for one-class SVM, it is a variance of the original SVM algorithm which has as a target to find the hyperplane is as close to the original dataset points as possible [12]. As a result, when a novel point is presented that is away from the hyperplane, then one-class SVM detects it as an anomaly
The knn algorithm is a supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point The algorithm computes the euclidean distance of a novel point in order to compute its k-nearest neighbors and then use the neighbors labels in order to determine the new points’ label.
Isolation forest or IForest is an anomaly detection algorithm that uses isolation (how far a data point is to the rest of the data), rather than modeling the normal points Isolation forests are based on decision trees; randomly sub-sampled data is processed in a tree structure based on randomly selected features. he samples that travel deeper into the tree are less likely to be anomalies as they require more cuts to isolate them. Similarly, the samples which end up in shorter branches indicate anomalies as it was easier for the tree to separate them from other observations [14].
After selecting the algorithms described above, then we trained the initial dataset for each one of those algorithms and saved the models that were produced on the .pkl files. Simultaneously we created our own muted spacer sequences on the RBS_BBa0034 variants-100 with a spacer length 5 and 100 more with a spacer length 6- and tested the 4 different trained models in order to rank the novel RBS variants in terms of “anomaly”. Then kept the top 20 variants from each model, i.e. the 20 sequences with the highest anomaly scores and then kept the variables that were present on all 4 “top 20” variant lists.
The resulted sequences from the in silico analysis are presented on table 1
Index | Name | Sequence |
---|---|---|
1 | RBS_22_spacer_5_vol_5 | AGAGGAGAAAGCCCCA |
2 | RBS_22_spacer_5_vol_96 | AGAGGAGAAACCTCCA |
3 | RBS_22_spacer_6_vol_29 | GAGGAGAAACATCCCA |
4 | RBS_22_spacer_6_vol_47 | GAGGAGAAACCCGCAA |
5 | RBS_22_spacer_6_vol_59 | GAGGAGAAACCCTCCA |
Due to a lack of time for testing all 5 sequences, the wet lab team initially arbitrarily decided on testing 3 variants. Since all variants have the same AC content and one T or C in the spacer region, the T containing variants were selected on the grounds of the rough estimation that an AT hydrogen bond with a neighboring sequence would offer less stability than a stronger GC bond. Hence, we were led to choose RBS variants with index values 2,3,5 on table 1 for experimental validation of our in silico analysis. In the end, due to time and wet lab engineering constraints, only the RBS variant with index value 5 (V59) was tested in comparison with the original BBa_B0034 RBS.
For a better view of the generated scripts and analysis, please refer to our improve-a-part repository on gitlab
The following RBS variants were selected from the in silico analysis (Table 2.)
RBS Name | 5' Scar | BBa_B0034 part sequence | RBS - Start Codon Spacer | Part Code |
---|---|---|---|---|
V59 | tactagag | aaagaggagaaa | ccctcca | BBa_K4294559 |
V96 | tactagag | aaagaggagaaa | cctcca | BBa_K4294596 |
V29 | tactagag | aaagaggagaaa | catccca | BBa_K4294629 |
Elowitz RBS | tactagag | aaagaggagaaa | tactaa | BBa_B0034 |
To verify the reduction in context dependency of the improved RBS spacer variants, different genetic contexts were introduced by fusions of the fluorescent protein sfGFP with the first 36 nucleotides of the proteins Penicillin acylase (PA), Phosphomevalonate Kinase (PMK) and AraC. The sequences of the PA and PMK 36nt fragments were derived from Duan et. al [5]paper data and the sequence of the sfGFP and AraC CDS from Mutalik et. al [4] paper data. sfGFP retains its function when it is fused with poorly folded peptides[16] and is therefore a reliable reporter for examining the effect of the genetic context on translation rate. The following characterization parts were generated (Table 3.)
Reporter | Part Code |
---|---|
36nt PA - sfGFP fusion Protein | BBa_K4294203 |
36nt PMK -sfGFP fusion Protein | BBa_K4294204 |
36nt AraC - sfGFP fusion Protein | BBa_K4294205 |
The above parts together with the CDS of sfGFP (BBa_K429420) create four genetic contexts in which the original BBa_B0034 RBS and the V59 RBS BBa_K4294559 variant were tested. Unfortunately the 36ntPMK - sfGFP fusion did not produce any fluorescence in both groups and therefore the context dependency was evaluated in three genetic contexts.
As it is explained in more detail in the attached pdf file below, only the V59 BBa_K4294559 RBS was finally tested in comparison with BBa_B0034. The V59 RBS part and the 36nt Double Stranded Oligo parts were constructed via primer annealing and extension. BsaI cut sites were introduced to facilitate the assembly of the characterization constructs following the Golden Gate Assembly Protocol for BsaI. The plasmid vector was the pTU1-A-lacZ plasmid that was widely used in our project. The Type2S cloning with BsaI compatible RBS BBa_B0034 was provided by the iGEM 2022 distribution kit plate 1. The transcription of the reporters was put under the control of the Ptet promoter and the final constructs were transformed into DH5a-z1 cells, which have genome encoded TetR production. Induction of bacteria was performed with 0.5 μM aTc concentration for 3h. OD and Fluorescence measurements were taken using the Flexstation 3 (Molecular Devices) plate reader. The Fluorescence (in RFU)/OD600 ratio was calculated for each reporter construct, the average value and the standard sample deviation of the ratios of all constructs for both RBS variants (original and v59) were calculated. Comparison of context dependence was evaluated via the calculation of the Coefficient of Variation of each group.
More information regarding the construct validation and measurement results are provided in this pdf file
The following figures and table summarize the experimental data regarding the Fluorescence/OD measurements of each reporter under translational regulation of the original BBa_B0034 RBS and the V59 RBS BBa_K4294559.
B0034 | v59 | ||
---|---|---|---|
Average | 4507.915552 | Average | 1996.770252 |
Standard Deviation | 1532.370964 | Standard Deviation | 397.9864727 |
CV | 0.3399289419 | CV | 0.1993151051 |
With the V59 RBS the Coefficient of Variance regarding the output signal is reduced by ~14%. The total output is also decreased; it is however in the same order of magnitude in comparison with the output of the BBa_B0034 regulated reporters.
The BBa_B0034 RBS was improved regarding its context dependence by the alteration of the RBS - Start Codon spacer region and the creation of the V59 BBa_K4294559. An obvious caveat regarding our results is the limited sample of the tested genetic contexts. However, since the choice of the potential spacer regions was supported by the in silico analysis, we believe that the V59 RBS is indeed an attractive alternative of the BBa_B0034 RBS and we encourage future iGEM teams in further characterizing it in new genetic contexts alongside with the other RBS candidate sequences that were not tested by our team. The potential reported lower translation rate could be alleviated by transcriptional enhancers [17] leading to the reconstruction of existing RBS parts that will achieve similar average translation rates in comparison with their respective original parts, but with a lower context dependence.