Model




Introduction

The vision of our team is to create a generic pipeline architecture that generates Toehold sequences for various cases. We wished to aggregate and develop sophisticated algorithms from different fields to solve this complex problem. The final architecture is complex and hard to follow. Thus we break it down into challenges and how to solve each one.

First challenge: finding the mRNA that is most specific to the target cell (for a detailed explanation, visit our description page). Using statistical methods to solve this challenge is the best approach, thanks to different big data sources.

Second challenge: finding the best trigger sequence. In order to solve this challenge, we developed an algorithm that searches for the best window, within the chosen mRNA sequence, for serving as our trigger sequence. Our algorithm is looking for different metrics that characterize good triggers like local folding energy, open structure, etc`

Third challenge: designing the best Toehold sequence for a given trigger. In order to solve this challenge, we combined different algorithms like NUPACK[4], RNAPKplex[5], IntaRNA[5]. Those algorithms are widely used in the scientific community and have an excellent reputation. On top of this, we also developed our new algorithms to solve sub-challenges, which were not addressed by those algorithms - Finding epitopes, using a different structure for the Toehold.

Fourth challenge: optimizing the translation efficiency model by using several methods to choose the right codons, such as Codon adaptation index(CAI), tRNA adaptation index (tAI), and Typical coding rate (TDR)

On top of those challenges, we generated a computational model (a linear regressor) based on experimental data for predicting the toehold performance in the presence of the trigger molecule and without it. We used the regressor to reveal the influence of different features on the toehold performance and to highlight insights about its design. We also utilized a data-based metric for toehold switch quality assessment. This metric, which was taken from previous work [2], is based on calculations of structural defects that will be elaborated later.







Novel Linear Regression based Metric for Toehold efficiency in Prokaryotes

We developed a novel linear regressor, based on experimental data in prokaryotes, for ranking toehold switch performance. The experimental data we used contains 168 toehold switch sequences that were designed by Green’s lab and tested in E. coli.[1]. The small amount of the data limited us in the accuracy of the predictions and therefore, we decided to use the data to reveal new insight on the influence of different features of the Toehold switch sequence on its performance. The features we created included different energy calculations and structural parameters of the toehold. Finding the correlated features helped us to better understand the design of the switch and improved our models. In addition, using the regressor we were able to rank a list of toehold switch sequences by their performance. The ranking contributed to assess the different toehold designs and therefore to improve our computational toehold switch design even before testing it in the wet lab.

Linear Regression

Linear regression is a statistics linear approach that attempts to model the relationship between a set of variables by fitting a linear equation to observed data[3].

The linear regression equation:

X in the equation is considered to be a vector of explanatory variables, and Y in the equation is considered to be a dependent variable. The relationship between the variables described by m in the equation (a vector or coefficient). The additional B parameter (a constant) captures all other factors which influence the dependent variable Y other than the variable X.

Linear regression analysis is used to predict the value of the dependent variable. In our model the prediction is for the protein abundance levels in the targeted cells and the levels in the non-targeted cells while the explanatory variables are the set of features we calculate based on the toehold switch sequence.

In order to find the relationship between the variables we trained the model with the 168 toehold switch sequences and their experimental results we received from Green’s lab. After training the model, we obtained the m and B vectors of the equation.

Our regressor predicts 3 parameters:

  • ON levels - the protein abundance in the cells expressing both the switch and its cognate trigger (‘targeted cells’).
  • OFF levels - the protein abundance in the cells expressing the switch and a noncognate trigger (‘non-targeted cells’).
  • ON/OFF ratio.

Most studies focused on the ON/OFF ratio as the objective function. Important step towards achieving better toehold performance was to redefine the objective function. When considering only the ratio, bad results could appear as good results. High ON/OFF ratio can resemble good results but in practice to be composed of low ON levels divided by lower OFF levels, which indeed end without effect on cells without the trigger but also without effect on the cells with the trigger. To consider results as satisfied we want to achieve high ON levels, high enough to obtain impact on the targeted cells, and low OFF levels to prevent impact on the non-targeted cells. Thus, we understood that the ratio by itself is not giving us enough information, and therefore we should also examine the absolute expression levels of ON and OFF separately. In this way we can achieve more insights regarding our results.

The results published in Green’s article contain only the ON/OFF ratio. Thus regarding our new conclusion, we asked them to send us the detailed results of the experiments, including ON and OFF levels separately. They kindly send us the full results of 168 toehold switches and those served us for developing our novel predictor.

Toehold switch sequence features

The 168 toehold switch sequences designed by Green et al shared similar structure, contains the following parameters:

  • The gene of interest is GFP
  • Stem length is 18 bp
  • Hairpin length is 11 bp
  • The RBS sequence is AGAGGAGA
  • Length between RBS to AUG is 6 bp

Figure 1 Illustration of the toehold switch structure

At the beginning of the toehold it has a trigger binding site, a complementary part to the trigger molecule sequence; after it, a stem is created. The first part of the stem is a continuation of the complementary part to the trigger. The stem is created thanks to a complementary part in the second half of the toehold. At the middle of the stem there is a small bulb created from the presence of ‘AUG’ on the other side. At the top of the stem, a hairpin is created and contains the RBS. Following the stem there is the gene of interest, the gene we aim to express in the targeted cells.

We created 27 features for the model, including energy calculations and GC content of different toehold structure components, structural parameters, interaction energy calculations and more. Attached here is the full list of features:

  • Minimum free energy (MFE) calculated with nupack package [4]
    • Switch MFE – The MFE of the switch sequence
    • Trigger MFE – The MFE of the trigger sequence
    • Switch switch MFE – The MFE of the interaction between two switche’s sequences
    • Trigger trigger MFE - The MFE of the interaction between two trigger’s sequences
    • Switch-trigger MFE- The MFE of the interaction between switch sequence and trigger sequence
    • Switch with GFP sequence MFE- The MFE of the switch sequence including GFP sequence
  • MFE calculated with Vienna package [5] –
    • Linker MFE - The MFE of the sequence of the linker part of the switch
    • Stem MFE - The MFE of the sequence of the stem part of the switch
    • RBS until end MFE - The MFE of the sequence from the RBS until the end of the switch sequence
  • Interaction energy calculated with intaRNA package [6]
    • Switch-trigger interaction energy
    • Switch-switch interaction energy
    • Trigger-trigger interaction energy
  • GC content calculations
    • GC content of the switch
    • GC content of the trigger
    • GC content of the stem part of the switch
    • GC content of the linker part of the switch
    • GC content of the sequence from RBS until the end of the switch
  • Structure parameters
    • Pre stem length - The length of the sequence before the stem of the switch
    • Complementary trigger (free part/stem part) - The ratio between the length of the sequence before the stem and the length of the trigger complementary part in the stem
    • Switch structure “.“ count - The amount of nucleotides without interactions to others in the switch
    • Trigger structure “.“ count - The amount of nucleotides without interactions to others in the trigger
  • The Nucleotide before hairpin (A-U or G-C)
  • Defect calculation - defect of the free part, defect of the sequence from the start until RBS and the defect of the sequence from RBS until the end
  • Kissing hairpins - binary feature indicates if there is a probability of interactions between the hairpins of two toehold switch molecules

Some of the features have been taken from Green et al work including the MFE calculation of the different structure components (e.g linker, stem etc.) and the identity of the nucleotide before the hairpin.

The defect calculations were inspired by another Green’s lab work by Ma et al, where they presented a method to rank toehold switches based on their previous experimental data.[2] The metric includes a combination of defect calculation, which is a probabilistic distance between the desired structure and the structure obtained in practice. The metric is described in detail in a different section on this page (Assessment Metric). We used the defect calculation as features for the model.

In the article of Green et al, they do not infer a complex model based on a combination of features, instead they report the best feature, which was the MFE calculation from RBS to end. Thus, we used this feature as a reference to our results. Any increase above it, considered for us as an improvement.

Features correlation to the labels

In order to find the best set of features for the model without overfitting, we performed features selection using adjusted correlation calculation as the optimization criteria.

First, we chose the best feature which had the highest correlation to the label.

Attached here a pearson correlation heatmap showing the correlation between the features to each of the labels. It can be seen from the heatmap that there are two features with the highest correlation to label ON. One positive is the stem MFE and one negative which is the switch-switch interaction. For label OFF, the most correlated feature is the stem MFE. However for the label ON/OFF, the correlations are lower but the highest one is the MFE of the Switch with the GFP sequence.


Figure 2 Correlation heatmap of the features and the labels

The heatmap described the Pearson correlation between the 27 features of the model (y-axis) to the model’s labels (x-axis). The sign of the correlation indicates the relationship between the feature and the label. Positive correlation means that with the increase of the feature’s value, the label’s value is increased; negative correlation means that with the increase of the feature, the label value is reduced.

The heatmap results agree with the literature and our assumptions. First, as can be seen, the ON/OFF label seems to be less predictive than the ON and OFF separately. In addition, the most correlated features emphasize that the stem part of the toehold has the most influence on the toehold performance and should be a significant factor in designing the toehold.

Features selection

Next, in each iteration we added one feature from the list and calculated the adjusted correlation of the chosen features. The adjusted correlation equation is [10]:

where p is the total number of explanatory variable in the model, and n is the sample size

Then, we chose the feature that combined with the best feature, yields the highest adjusted correlation. We repeated this part, adding more features to the model until we received no additional improvement in the adjusted correlation. The result features that obtained for each objective appear in Table 1.


Table 1 - Best features selected for each label (along with their corresponding correlation coefficients).



Table 2 - The adjusted correlation of our best set features vs Green’s best feature

From the results it can be seen that we succeed in improving the adjusted correlation using the set of features selected by our model. It means that we improved the ability to predict the ON levels, OFF levels and ON/OFF ratio.

The regression coefficients of the three models appear in the lists below (as well as in Table 1):

ON label coeffients = [-36796.8123923 , 17799.98343884, -18296.37299861, -21000.42057797, 28222.39719668, 39041.79498465, 27075.60791862, -29849.9025678 , -4976.18341367, -4483.06546978, 5311.63763123, 19619.34307479, -20438.73825092, 4253.20887393]

OFF label coeffients= [-4513.77098889, 1894.95663112, 1338.77362406, -1892.02861732, 1909.80579024, -581.19563308, 712.72225314, -1846.70264986, 1954.02997273, -709.01412254, 964.28700903 ]

ON_OFF coeffients = [ 82.07104428, -45.52451068, 55.90282051, -19.72310897, -34.506619 , -31.93795545, -18.91773184, 16.54207188]








Trigger Selection


Our trigger selection algorithm is composed of 2 stages: mRNA selection and window selection within the mRNA molecule.

Definitions

Trigger is a consecutive section of nucleotides of the mRNA which are predicted to hybridize with our toehold switch. We will refer to it as the trigger window in the mRNA. The trigger has a certain length which is a parameter set by the user. For a single mRNA molecule, different triggers with a certain length vary by their start offset of the window section which is a partial sequence of the mRNA.

mRNA Selection

    In choosing an mRNA and a trigger there are few major aspects.
  • mRNA should be highly expressed in the target cells while having as low expression levels as possible in other cells leading to choosing high fold change in expression between the target cells vs others.
  • The basal level of the mRNA in the target cells should be high so that there will be higher concentration and higher probability of interaction with our toehold switch.
  • The mRNA should have a segment which is not folded with high probability so that it would hybridize with the toehold part of our designed switch to undergo hybridization between the mRNA trigger section and the designed toehold switch.

The main data source for choosing mRNA are databases of RNA sequencing experiments, whether single cell or not.

The main data source for choosing mRNA are databases of RNA sequencing (RNA-seq) experiments, whether single cell or not. Based on the RNA-seq data, we define target cells (e.g. certain types of cancer, certain tissue, or any other differential attribute) which were monitored in the RNA-seq experiments. Such expression level information can be used to find a relevant trigger. To find mRNA differentially expressed in one type of cells vs. other type we can use statistical approach for multiple adjusted p values calculation.

Finding best triggers for mRNA

Since our demand is to find a window in an mRNA that will be a good trigger, we should consider both the expression levels of the mRNA and the local folding of that window in the mRNA.

    Thus, the algorithm has 2 stages:
  • The first stage returns mRNA candidates for the next stage.
  • The second stage returns trigger section candidates, which are windows within the mRNA molecule, for each mRNA candidate from the previous stage.

In each stage scores are generated, score for mRNA and score for a certain window in the mRNA which will serve as a trigger. Finally, the scores of the 2 stages are combined and the best candidates are chosen according to the total score.

Score for mRNA

Based on the DEG (differentially expressed genes) data we first filter out those genes with p-value above a specified threshold.

    We then give mRNA score based on:
  • Fold change of the expression levels in the target cells vs. other cells (we want it to be as high as possible).
  • Basal expression level in target cells (we want it to be as high as possible).

Both scores are combined by multiplication.

An example of mRNA scoring for BLCA cancer type is presented at table 3 (differential expression genes data taken from oncodb.org[34])


Table 3 - mRNA scoring method of BLCA cancer type


Score for Trigger

We use algorithms for prediction for each mRNA the mRNA secondary structure, folding free energy and nucleotides base-pairing probability matrix based on the ensemble of secondary structures. Those algorithms are probabilistic, and the accuracy gets decreased as the predicted sequence length gets longer. Based on biophysical models of the ribosome movement during translation[35], it is suggested that the ribosome unfold the mRNA during the translation thus local mRNA (insub-regions of the mRNA) folding is the more relevant measure of the accessibility of a sub window in the mRNA (as illustrated in figure 3).


Figure 3 - Ribosome unfolds mRNA during translation


Thus, we are computing averaged hybridization predictions metrics based on aggregation of two levels of moving windows on the mRNA sequence. We use an averaging window approach for gaining robustness by limiting the influence of single prediction related to one subsequence. The approach is explained below:

    We define two windows for the algorithm:
  • The primary moving window is a subsequence of the mRNA. The length of that window is a constant parameter (120 nucleotides).
  • The secondary window is the trigger window which is a subsequence of the primary window.

The length of that window is another parameter that may be chosen by the user and represents the length of the hybridization site size between the triggering mRNA and our designed toehold switch.

    The main algorithm engine is a loop in a loop calculation:
  • The outer loop scans the mRNA with a moving primary window (i.e. in each step we increment its offset by one nucleotide).
  • The inner loop scans the primary window, which is longer than the trigger secondary window, with a moving window related to the secondary window (i.e. at each step we increment the offset of the secondary window in the primary window).

Thus, each trigger (the secondary window which is the inner window) appears a few times in a few different outer windows. We finally calculate for each trigger an average of the calculated metrics defined below aggregated from each of its occurrences in the different outer windows.


Figure 4 - An illustration of the two sliding windows in our algorithm for finding an efficient trigger.


    For each outer window subsequence, we calculate:
  • Base pairing probabilities
  • Folding free energy
  • MFE predicted secondary structure
    Using the metrics calculated for each outer window, we calculate for each inner window:
  • Inner window average base pairing probabilities (considering only the inner window nucleotides probabilities from the outer window)
  • Inner window average base paired secondary structure ratio from the outer window.
  • Partial folding free energy, which is part of the outer window folding free energy calculated by multiplying the outer window folding free energy by the inner window folded structure ratio.
    Finally, we allow choosing between 2 metrics:
  • Averaged partial folding free energies of trigger section - average of the partial folding free energies nucleotides multiplied by base pairing probabilities of that trigger window subsequence over its appearance in the moving outer window.
  • Averaged folding probability of trigger section - average of the base pairing probabilities of the nucleotides of that trigger window subsequence over its appearance in the moving outer window.

In the following figures (figure 4 and Figure 6) we give an example of the trigger score for a certain gene(TK1) with certain trigger size windows(51 nucleotides) calculated by the 2 metrics defined above.

Example of averaged partial folding free energies metric (gene TK1 1269 nucleotides long) is described in figure 4.


Figure 4 - An example of averaged partial folding free energies metric (gene TK1 1269 nucleotides long)

The x axis is the number of nucleotides offset from the beginning of the gene ORF (open reading frame). The trigger window starts at that offset with a length of trigger size. The y axis is the calculated score. Higher scores are expected to be associated with positions that have unfolded secondary structure and thus can be used as triggers. We can see the 6 highest predicted trigger scores for TK1 gene with subsequence trigger window size of 51 nucleotides are in offsets (peaks marked with red circles on figure) ~160, ~180, ~1170, ~1100, ~950, ~820 (the algorithm avoids returning very close offsets to already chosen ones and will avoid returning candidates offsets close by less than about 10 nucleotides) Those marked offsets will be chosen as best candidates trigger offsets for that gene by the algorithm.


An example of averaged folding probability metric (gene TK1 1269 nucleotides long) is presented in Figure 6.


Figure 6 - An example of averaged folding probability metric (gene TK1 1269 nucleotides long)

The x axis is the number of nucleotides offset from the beginning of the gene ORF (open reading frame). The trigger window starts at that offset with a length of trigger size. The y axis is the calculated score. Higher scores are expected to be associated with positions that have unfolded secondary structure and thus can be used as triggers. We can see the 5 highest predicted trigger scores for TK1 gene with subsequence trigger window size of 51 nucleotides are in offsets (peaks marked with red circles on figure) ~975, ~1180, ~820, ~430, ~280 (the algorithm avoids returning very close offsets to already chosen ones and will avoid returning candidates offsets close by less than about 10 nucleotides) Those marked offsets will be chosen as best candidates trigger offsets for that gene by the algorithm.


Total score for mRNA and trigger window within it

By multiplying the mRNA score with the trigger score, we generate a ranking of n best candidates for toehold switch generation.

Continuing the use case example in table 3 above (i.e. choosing best mRNA for BLCA cancer type) we generate n best windows based on the combined score that considers both the mRNA expression levels and the folding of the window within it. The list of n best candidates in this case is presented in table 4 for n=10.


Table 4 - Ten Best trigger window score in mRNA molecules expressed in BLCA cancer type.


Trigger Based on Cancer-Related Mutations

To broaden our trigger generation space, we consulted with Dr. Yoram Zarai from OncoDechipher and recieved a large data set comprised of over 8300 cancer patients suffering from various types of cancer. We analyzed this data set, which contains over 750 genes, with the intention to find mutations (or group of mutations) that are meaniningful enough for the creation of a trigger that will responde to mutated mRNA molecules but not to the healthy ones. Read more about this subject in our computational proof of concept


Figure 7 - An overview of our cancer-related mutations database, which contains more than 8300 patients and 750 mutated genes.





Sequence Design


NUPACK as a Starting Point

    Our goal is to generate sequences based on two main factors:
  1. Structure: Toehold sequences are functional only thanks to their unique structure. This structure is known as the "Hairpin structure". The stem and hairpin inhibit the translation unit from docking in, thus preventing translation.
  2. Interaction: The Toehold structure is unique thanks to its ability to get opened. It contains a trigger binding site that enables it to be opened in the presence of a specific sequence.

In order to create a sequence that utilizes both attributes, a sophisticated algorithm is needed. Starting a new algorithm from scratch will be irrelevant, as the scientific community has been dealing with this problem for years. After a literature scan, we came across NUPACK - design algorithm.

NUPACK enables the users to declare a target structure and different kinds of constraints. The algorithm considers all the parameters and generates a sequence as close to the target structure as possible. In addition to structure optimization, NUPACK can consider the interaction between several sequences, thus making it perfect for problems like generating Toehold sequences. As strong as NUPACK is, it cannot consider the much bigger picture. Trigger selection, initiation and translation factors and complex structure like pseudoknots are all examples of issues that are not addressed by NUPACK and are critical to the toehold performance. That's where we come into place, trying to close the gap between the NUPACK narrow view on the much more complex problem.

Our Usage with NUPACK API

NUPACK algorithm, as mentioned before, is widely used. Therefore, its input parameters are not designed only for the Toehold sequencing problem. It supports a much wider variety of issues, thus leading to a much richer and more generic interface. In order to adjust the algorithm to suit our goals and needs, we used the input narrowly and more precisely.

One significant example is the use of domains:

NUPACK Domains enable the user to separate an area from any sequence and classify it by name and purpose. Naturally, we divided our structure into the domains presented in Figure 8.


Figure 8 - separation to domains of Toehold structure

The purpose of this division is to classify each part of the sequence with its biological meaning.

    Separating the structure to those domains enabled us to apply constraints in a "smart" manner, thus dealing with some obstacles that could not be overcome with the basic usage of NUPACK:
  1. Preventing Stop and additional Start codons is relevant only for domains that come after the ribosome binding site, as their presence in those domains might interrupt the translation. After we divided the sequences into such domains, applying those constraints was trivial. Though it sounds basic, we have come across other works that did not address this issue(see our enginnering page), and we wanted to ensure that we will not overlook it.
  2. Preventing repetitions of any nucleotide of at least five occurrences. We are generating sequences that have to go through industrial production. Repetitions are not allowed in this process as they increase the chances of synthesis mistakes. In contrast to the previous constraints, this one should be applied to the whole sequence.
  3. Enforcing the gene to be aligned in the correct reading frame, which may also seem like a trivial constraint. However, it is easy to overlook and miscalculate the distance between the first start codon and the beginning of the gene. The above should be applied only in the ORF domain.

We also utilize a dynamic separation into domains in a more sophisticated manner while using our pseudoknots avoidance algorithm, as detailed in the next section.

Organism-Specific Adjustments

Our model takes into account the organism of interest. It uses different structures for prokaryotes (Pardee's "Series B structue[30]) and eukaryotes (a structure selected by us after some validation experiments). In addition, organism-compatible ribosome binding site is selected - Shine-Dalgarno sequence [31] for prokaryotes and Kozak sequences for Eukaryores (while considering the differences in kozak sequences between yeast and Homo sapiens [32,33]).







Pseudoknots Avoidance


Pseudoknots are formed upon base pairing of a single-stranded region of RNA in the loop of a hairpin to a stretch of complementary nucleotides elsewhere in the RNA chain [7]. Pseudoknot types are presented in Figure 9. Their occurrence can harm switch efficiency by creating a locked structure and preventing translation even in targeted cells or creating a different structure than planned.


Figure 9 - Pseudoknot types [8]


NUPACK documentation address this issue:

"In NUPACK 4, pseudoknots are excluded from the structural ensemble."

In our algorithm, we are using the RNAPKplex package by Vienna RNA [5]. This package is at present the only component of the Vienna RNA package that predicts pseudoknotted RNA structure. RNAPKplex is a variant of a basic component of this package, RNAup, which computes RNA-RNA interactions by first computing the local opening energies of two RNA molecules and then computing interaction energies between those molecules. Unlike RNAup, RNAPKplex computes the interactions between positions within the same RNA molecule.

In our pipeline, right after generating the sequence, pseudoknot's presence is checked. If a pseudoknot is found, we use dynamic separation for domains to redesign the sequence in the correct position. At first, the sequence is referred to as a straight line and when a pseudoknot is found, the segment containing it gets separated from the line. This segment is defined as a new separate domain, a thing that allows us to apply constraints on it exclusively. It is important to note that this separation does not harm the whole structure because NUPACK calculates the structure using the combination of all the domains together, and not separately.

The nucleotides that form the pseudoknotted structure are defined as a new constraint that will be applied on the new defined domain, and a new sequence is designed.







Avoidance of Immune Response


Epitope Prevention

When aiming for treatment in humans, an immune response should always be taken into account. For that reason, we implemented an epitope prevention algorithm. Immune response can be caused by epitopes. An epitope is a peptide, typically consisting of around 6 amino-acids, that is recognized by receptors found on B-cells as a foreign (non-self) protein. The receptors bind to the epitope by its structure and recruit the immune system in order to treat the menace.

As part of our design, some amino acids are added to the protein of interest in its 5` side, due to the fact that the translation begins before the start of the sequence of the protein (as shown in Figure 10).


Figure 10 - Visualization of added amino acids

In order to deal with this problem, we implemented an epitope prevention algorithm. In this algorithm, we first check which amino acids are translated from the nucleotides adjacent and downstream of the AUG (see Figure 10). After that, we check whether this amino-acid sequence contains a known epitope, by looking in a vast database of known epitopes taken from the literature [9].

We used the online tool IEDB by NIAID [9], and restricted our search to epitopes that are known to induce an immune response in humans (and not in other organisms), as shown in Figure 11. We got a list of 631,679 peptides and used it as our reference database.


Figure 11 - IEDB online tool

In case that an epitope is found, the nucleotides that create the problematic sequence are defined as a new constraint, and a new switch sequence is generated.



Uridine Depletion

Following our consultation with Lonza Company, we learned about the immune response activation by uridine rich mRNA sequences [27], so we decided to add a novel uridine depletion algorithm to our pipeline.

Since we are not producing synthetic mRNAs, using modified uridine is not a viable option for reducing immune response activation. Another option is to switch uridine containing codons with synonymous codons without uridine. To this end, we gave a weight to each codon according to the number of uridines it includes (more uridines related to lower weight) and aimed at replacing codons with synonymous codes with higher weights. Thus, this part of our pipeline receives a sequence, and when possible, replaces uridine containing codons to the next best synonymous codon (according to typical decoding rate) which does not contain uridine.


Figure 12 - A simple illustration of replacing codons to synonymous ones for depletion of uridine from an mRNA sequence.







Assessment Metric


Following our conversation with team Aboa, we have implemented a toehold quality assessment metric for ranking our toeholds and improving our certainty, taken from a paper published by Green et al in 2018, in the synthetic biology journal [2]. This metric is based on calculations of probabilistic distance between the desired structure and the actual structure of certain elements, termed as “ensemble defects”. At first, pairwise binding probabilities of the input sequence are calculated. After that, those probabilities are used to calculate the distance between the desired structure and the predicted structure of the input sequence. A scoring function, consisting of these ensemble defects, was developed Based on an experimental data. This function is presented in Equation 3.

The parameters of this function were derived from Green’s previous research in which toehold sequences were tested in E. coli and their performance was assessed according to fluorescence intensity. In this paper, the switch general structure was divided into 4 areas of interest: toehold region, active sensor, min sensor and binding site, as shown in Figure 13 (taken from Green et al in 2018). The ensemble defects of these areas were calculated using NUPACK. When calculating defects, the desired structure has to be defined in order to be compared with the structure of the input sequence. The desired structure of those areas of interest are detailed below (see also Figure 13)

  • For the toehold region, which is defined as the sequence before the beginning of the stem, the desired structure is completely single-stranded, in order to be as free as possible for binding the trigger molecule. dtoehold denotes the defect related to this region,
  • For the active sensor region, which is defined as the sequence that starts from the nucleotide specified in figure X and spans until the end of the switch, the desired structure is completely single-stranded. dactive_sensor denotes the defect related to this region.
  • For the min sensor, which begins from the first nucleotide and ends in the last nucleotide of the stem, the desired structure is identical to the input structure of the complete switch, without the last open part (without the sequence after the stem). This structure contains the toehold region, the stem and the loop. dmin⁡_sensor denotes the defect related to this region.
  • For the binding site region, which is defined as the sequence that will bind to the trigger molecule (the red dots in Figure 13), the desired structure is completely single-stranded as well, for the same reason as the toehold region. dbinding_site denotes the defect related to this region.

Figure 13 - Division of switch general structure to areas of interest


The defects were then normalized by the same normalization method used in the paper: For every batch of switches that were generated for the same trigger, every different defect was divided by the maximal value of the same defect over all switches, making the maximal value possible for each defect 1. After that, the score of the switch was calculated using the function from equation 1, and the switches could be ranked. It is worth mentioning that lower score indicates better switch, due to the fact that defect levels represent the distance from the desired value, and the main assumption is that the input structure is ideal and deviations from it will reduce the switch efficiency.




Translation Efficiency Model


The mRNA part that undergoes translation is called “Open reading frame” (ORF). In the translation process, the ribosome “translates” codons to amino acids, according to the genetic code. The genetic code is the codon-amino acids pairs, and since we have 64 possible codons and only 20 amino acids, it is redundant. This redundancy led to codons being evolutionarily selected by translation machinery, and some codons are translated more efficiently than others and are usually more frequent. This concept is known as codon usage bias (CUB), and there are measures related to it that can be used to optimize translation efficiency. There are several optimization methods for choosing the right codons that we considered (see below). In addition, GC content should be taken into account when modifying the translated sequence as it is known to affect various gene expression aspects.

To optimize our sequences, we used the DNA chisel [24] optimization framework, which is an easy-to-use python library. It has two main steps – resolving constraints and optimizing according to objective. In our optimization function, we allow for GC content constraints, which is defaulted to be between 30% and 70%. In the future, other constraints could be added, such as pattern avoidance. For the optimization objectives, we allow for choosing between 3 optimization parameters that are defined below (Codon adaptation index, tRNA adaptation index and Typical Decoding Rate) and 3 DNA chisel supported codon optimization methods that are defined below (use best codon, match codon usage and relative codon adaptiveness).


Figure 14 - Example optimization framework workflow (taken from dnachisel's repository)


Optimization Parameters - Codon Usage Bias Indices:

Over the years, several codon usage bias indices have been developed, based on different characteristics. In our software, we allow translation optimization based on 3 different indices which are described below.

Codon Adaptation Index (CAI)

The codon adaptation index of a gene is a statistical approach that estimates the extent of its bias towards codons that are known to be favored in a reference set of genes. Usually, this set of genes is composed either of highly expressed genes or the whole genome. The chosen set of genes is then used to calculate the weights of each codon, also called the “relative adaptiveness”, as seen in the equation 4.

Xi - the number of appearances of the codon in the reference set.

xmax - the number of appearances of the most frequently used codon for an amino acid (from the synonymous codons).

This is equal to the RSCU ratio, which is defined as the observed appearance of the codon divided by the expected appearance under equal use of synonymous codons.

● In case a codon is not observable in the reference set, we assign its count to 0.5, so that its weight is set to:

These calculated weights can later be used to optimize translation efficiency. By switching codons with lower weights to codons with higher weights, the translation process is expected to be affected positively.

For our software, we calculated CAI weights for Homo sapiens, S.cerevisiae and E.coli, based on frequency data from Kazusa database [21]. Optimization for other organisms could be added in the future.


tRNA Adaptation Index (tAI)

Another index for codon usage bias is the tRNA adaptaion index. In contrast to the statistical codon adaptation index, tAI is biophysical measure, which assumes that highly expressed genes are adapted to the organisms tRNA pool (the supply of the tRNA molecules that match specific codons). Codons that their corresponding tRNA are readily available are considered more optimal, as the efficiency of their translation is not decreased due to the lack of the tRNA molecule.

To calculate the weights of each codon, an estimation of the tRNA pool of the organism is used, along with the consideration of tRNA-codon pairs effectiveness, including both Watson-Crick and wobble interactions.

As measuring tRNA levels is not a simple undertaking, tAI uses the number of copies of each tRNA in the genome, which is shown to be correlated with tRNA expression levels [22]. The copy number of tRNAs can be extracted from http://gtrnadb.ucsc.edu/index.html.


Figure 15 - Example of Homo sapiens tRNA copy number tables

The weights are then calculated for each codon according to equation 6:

Where i stands for each codon (i.e. the index of the codon), n is the number of tRNAs that recognize codon i and Sij values are a measure of effectiveness of the tRNA-codon interaction, fitted by experimental protein abundance data [23].

Finally, the weights are normalized according to the most optimal codon overall (and not only according to synonymous codons as with CAI), as in equation 7.

However, copy numbers and corresponding calculated weights are readily available in the supplementary materials of Tuller et.al. 2010 article [11], so we did not calculate them on our own. Nevertheless, we validated that the weights are in agreement with the data on the website (as biological data is regularly updated with results from new studies). For our software, we used tAI weights for Homo sapiens, S.cerevisiae and E.coli, but tAI weights for other organisms can be added in the future.


Typical Decoding Rate (TDR)

Typical decoding rate is an empirical measure which is calculated from experimental ribo-seq data. In ribo-seq experiments[29], we gather information on the presence of the ribosome on each codon (footprint count), where the more reads on a certain codon means that it takes the ribosome longer to translate it.


Figure 16 - Decoding time estimation, figure taken from Tuller et al. 2014 [11]

Translation of codons by ribosomes is stopped and exposed mRNAs are digested. Protected mRNA footprints are then sequenced and mapped to the genome, creating a footprint count (FC) profile for each gene.


For the calculation of TDR weights, codon footprint counts are taken from all mRNAs and are used to estimate the decoding time.

First, a normalized footprint count for each ORF is calculated according to equation 8.

Where FCi is the footprint count (or read count) for codon i and it is normalized by the average footprint count over all codons of the ORF (in the case of the equation above the ORF includes L codons). Then, we generate the distribution of normalized footprint counts (NFC) of each of the codons based on the values of the NFC of the codon over all its positions in all the mRNAs (see Figure 17). Finally an exponentially modified gaussian (EMG) distribution [28] is fitted each of the empirical distributions.


Figure 17 - Example of normalized footprint count (or normalized read count) distribution for two codons, taken from Tuller et al. 2014 [11].


The typical decoding rate of a codon is the parameter related to the mean of the Gaussian component of the EMG distribution fitted to its NFC distribution. . The reciprocal of the typical decoding time is the typical decoding rate which is used as the weight for each codon. All the details of this index appear in Tuller et al. 2014 [11].

Since we did not have the resources to perform ribo-seq experiments, we based our weights over existing typical decoding times taken from this database [11,12] which used ribosomes profiles from [13-19], for Homo sapiens, S.cerevisiae and E.coli, but weights for other organisms can be added in the future.


Codon Optimization Methods:

    The 3 methods supported by DNA chisl for codon optimization are:
  • "Use best codon": Every codon is replaced with the most frequent synonymous codon in the target organism, equivalent to the Codon Adaptation Index (CAI) optimization that we described above.
  • “Match codon usage”: the final sequence's codon usage will try to match the codon usage profile of the target species. This method is used throughout the literature, for instance Hale and Thomson 1998 [25].
  • “Harmonize RCA”: RCA – relative codon adaptiveness. Each codon is replaced by a synonymous codon, whose usage in the target organism matches the usage of the original codon in its host organism [26].






References


  1. Green, A. A., Silver, P. A., Collins, J. J., & Yin, P. (2014). Toehold switches: de-novo-designed regulators of gene expression. Cell, 159(4), 925-939
  2. Ma, D., Shen, L., Wu, K., Diehnelt, C. W., & Green, A. A. (2018). Low-cost detection of norovirus using paper-based cell-free systems and synbody-based viral enrichment. Synthetic Biology, 3(1), ysy018.
  3. Uyanık, G. K., & Güler, N. (2013). A study on multiple linear regression analysis. Procedia-Social and Behavioral Sciences, 106, 234-240.
  4. Zadeh, J. N., Steenberg, C. D., Bois, J. S., Wolfe, B. R., Pierce, M. B., Khan, A. R., ... & Pierce, N. A. (2011). NUPACK: Analysis and design of nucleic acid systems. Journal of computational chemistry, 32(1), 170-173.
  5. Lorenz, R., Bernhart, S. H., Höner zu Siederdissen, C., Tafer, H., Flamm, C., Stadler, P. F., & Hofacker, I. L. (2011). ViennaRNA Package 2.0. Algorithms for molecular biology, 6(1), 1-14.
  6. Mann, M., Wright, P. R., & Backofen, R. (2017). IntaRNA 2.0: enhanced and customizable prediction of RNA–RNA interactions. Nucleic acids research, 45(W1), W435-W439.
  7. Brierley, Ian, Simon Pennell, and Robert JC Gilbert. "Viral RNA pseudoknots: versatile motifs in gene expression and replication." Nature Reviews Microbiology 5.8 (2007): 598-610.
  8. Legendre, Audrey, Eric Angel, and Fariza Tahi. "Bi-objective integer programming for RNA secondary structure prediction with pseudoknots." BMC bioinformatics 19.1 (2018): 1-15.
  9. Kim, Yohan, et al. "Immune epitope database analysis resource." Nucleic acids research 40.W1 (2012): W525-W530.
  10. Leach, L. F., & Henson, R. K. (2007). The use and impact of adjusted R2 effects in published regression research. Multiple Linear Regression Viewpoints, 33(1), 1-11.
  11. Dana A. and Tuller T. The effect of tRNA levels on decoding times of mRNA codons. Nucleic Acids Res. 2014
  12. Dana A. and Tuller T. Mean of the typical decoding rates: a new translation efficiency index based on ribosome analysis data. 3G. 2014
  13. Li, G.W., Oh, E. & Weissman, J.S. The anti-Shine-Dalgarno sequence drives translational pausing and codon choice in bacteria. Nature 484, 538-41 (2012).
  14. Stadler, M., Artiles, K., Pak, J. & Fire, A. Contributions of mRNA abundance, ribosome loading, and post- or peri-translational effects to temporal repression of C. elegans heterochronic miRNA targets. Genome Res (2012).
  15. Ingolia, N.T., Lareau, L.F. & Weissman, J.S. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 147, 789-802 (2011).
  16. Guo, H., Ingolia, N.T., Weissman, J.S. & Bartel, D.P. Mammalian microRNAs predominantly act to decrease target mRNA levels. Nature 466, 835-40 (2010).
  17. Lee, S., Liu, B., Huang, S.X., Shen, B. & Qian, S.B. Global mapping of translation initiation sites in mammalian cells at single-nucleotide resolution. Proc Natl Acad Sci U S A 109, E2424-32 (2012).
  18. Ingolia, N.T., Ghaemmaghami, S., Newman, J.R. & Weissman, J.S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218-23 (2009).
  19. Brar, G.A. et al. High-resolution view of the yeast meiotic program revealed by ribosome profiling. Science 335, 552-7 (2012).
  20. Lee S, Weon S, Lee S, Kang C. Relative codon adaptation index, a sensitive measure of codon usage bias. Evol Bioinform Online. 2010 May 5;6:47-55. doi: 10.4137/ebo.s4608. PMID: 20535230; PMCID: PMC2880845.
  21. Nakamura, Y., Gojobori, T., & Ikemura, T. (2000). Codon usage tabulated from international DNA sequence databases: status for the year 2000. Nucleic acids research, 28(1), 292-292.
  22. Kimberly A. Dittmar, Evelyn M. Mobley, Agnes Jancso Radek, Tao Pan,Exploring the Regulation of tRNA Distribution on the Genomic Scale, Journal of Molecular Biology, Volume 337, Issue 1, 2004, Pages 31-47, ISSN 0022-2836
  23. Mario dos Reis, Renos Savva, Lorenz Wernisch, Solving the riddle of codon usage preferences: a test for translational selection, Nucleic Acids Research, Volume 32, Issue 17, 1 September 2004, Pages 5036–5044
  24. Zulkower, V., & Rosser, S. (2019). DNA Chisel, a versatile sequence optimizer. bioRxiv.
  25. Hale RS, Thompson G. Codon optimization of the gene encoding a domain from human type 1 neurofibromin protein results in a threefold improvement in expression level in Escherichia coli. Protein Expr Purif. 1998 Mar;12(2):185-8. doi: 10.1006/prep.1997.0825. PMID: 9518459.
  26. Claassens NJ, Siliakus MF, Spaans SK, Creutzburg SCA, Nijsse B, Schaap PJ, et al. (2017) Improving heterologous membrane protein production in Escherichia coli by combining transcriptional tuning and codon usage algorithms. PLoS ONE 12(9): e0184355. https://doi.org/10.1371/journal.pone.0184355
  27. Vaidyanathan S, Azizian KT, Haque AKMA, Henderson JM, Hendel A, Shore S, Antony JS, Hogrefe RI, Kormann MSD, Porteus MH, McCaffrey AP. Uridine Depletion and Chemical Modification Increase Cas9 mRNA Activity and Reduce Immunogenicity without HPLC Purification. Mol Ther Nucleic Acids. 2018 Sep 7;12:530-542. doi: 10.1016/j.omtn.2018.06.010. Epub 2018 Jun 30. PMID: 30195789; PMCID: PMC6076213.
  28. Eli. Grushka Analytical Chemistry 1972 44 (11), 1733-1738
  29. Ingolia NT. Ribosome profiling: new views of translation, from single codons to genome scale. Nat Rev Genet. 2014 Mar;15(3):205-13. doi: 10.1038/nrg3645. Epub 2014 Jan 28. PMID: 24468696.
  30. Keith Pardee, Alexander A. Green, Melissa K. Takahashi, Dana Braff, Guillaume Lambert, Jeong Wook Lee, Tom Ferrante, Duo Ma, Nina Donghia, Melina Fan, Nichole M. Daringer, Irene Bosch, Dawn M. Dudley, David H. O’Connor, Lee Gehrke, James J. Collins, Rapid, Low-Cost Detection of Zika Virus Using Programmable Biomolecular Components, Cell, Volume 165, Issue 5, 2016, Pages 1255-1266, ISSN 0092-8674
  31. J. Shine, L. Dalgarno, Occurrence of heat-dissociable ribosomal RNA in insects: The presence of three polynucleotide chains in 26 S RNA from cultured Aedes aegypti cells, Journal of Molecular Biology, Volume 75, Issue 1, 1973, Pages 57-72, ISSN 0022-2836
  32. Xu, L., Liu, P., Dai, Z. et al. Fine-tuning the expression of pathway gene in yeast using a regulatory library formed by fusing a synthetic minimal promoter with different Kozak variants. Microb Cell Fact 20, 148 (2021).
  33. McClements ME, Butt A, Piotter E, Peddle CF, MacLaren RE. An analysis of the Kozak consensus in retinal genes and its relevance to gene therapy. Mol Vis. 2021 May 8;27:233-242. PMID: 34012226; PMCID: PMC8116250.
  34. OncoDB: an interactive online database for analysis of gene expression and viral infection in cancer Gongyu Tang, Minsu Cho, Xiaowei Wang
  35. Composite effects of gene determinants on the translation speed and density of ribosomes T Tuller, I Veksler-Lublinsky, N Gazit, M Kupiec, E Ruppin, M Ziv-Ukelson Genome biology 12 (11), 1-18