The vision of our team is to create a generic pipeline architecture that generates Toehold sequences for various cases. We wished to aggregate and develop sophisticated algorithms from different fields to solve this complex problem. The final architecture is complex and hard to follow. Thus we break it down into challenges and how to solve each one.
First challenge: finding the mRNA that is most specific to the target cell (for a detailed explanation, visit our description page). Using statistical methods to solve this challenge is the best approach, thanks to different big data sources.
Second challenge: finding the best trigger sequence. In order to solve this challenge, we developed an algorithm that searches for the best window, within the chosen mRNA sequence, for serving as our trigger sequence. Our algorithm is looking for different metrics that characterize good triggers like local folding energy, open structure, etc`
Third challenge: designing the best Toehold sequence for a given trigger. In order to solve this challenge, we combined different algorithms like NUPACK[4], RNAPKplex[5], IntaRNA[5]. Those algorithms are widely used in the scientific community and have an excellent reputation. On top of this, we also developed our new algorithms to solve sub-challenges, which were not addressed by those algorithms - Finding epitopes, using a different structure for the Toehold.
Fourth challenge: optimizing the translation efficiency model by using several methods to choose the right codons, such as Codon adaptation index(CAI), tRNA adaptation index (tAI), and Typical coding rate (TDR)
On top of those challenges, we generated a computational model (a linear regressor) based on experimental data for predicting the toehold performance in the presence of the trigger molecule and without it. We used the regressor to reveal the influence of different features on the toehold performance and to highlight insights about its design. We also utilized a data-based metric for toehold switch quality assessment. This metric, which was taken from previous work [2], is based on calculations of structural defects that will be elaborated later.
We developed a novel linear regressor, based on experimental data in prokaryotes, for ranking toehold switch performance. The experimental data we used contains 168 toehold switch sequences that were designed by Green’s lab and tested in E. coli.[1]. The small amount of the data limited us in the accuracy of the predictions and therefore, we decided to use the data to reveal new insight on the influence of different features of the Toehold switch sequence on its performance. The features we created included different energy calculations and structural parameters of the toehold. Finding the correlated features helped us to better understand the design of the switch and improved our models. In addition, using the regressor we were able to rank a list of toehold switch sequences by their performance. The ranking contributed to assess the different toehold designs and therefore to improve our computational toehold switch design even before testing it in the wet lab.
Linear regression is a statistics linear approach that attempts to model the relationship between a set of variables by fitting a linear equation to observed data[3].
The linear regression equation:
X in the equation is considered to be a vector of explanatory variables, and Y in the equation is considered to be a dependent variable. The relationship between the variables described by m in the equation (a vector or coefficient). The additional B parameter (a constant) captures all other factors which influence the dependent variable Y other than the variable X.
Linear regression analysis is used to predict the value of the dependent variable. In our model the prediction is for the protein abundance levels in the targeted cells and the levels in the non-targeted cells while the explanatory variables are the set of features we calculate based on the toehold switch sequence.
In order to find the relationship between the variables we trained the model with the 168 toehold switch sequences and their experimental results we received from Green’s lab. After training the model, we obtained the m and B vectors of the equation.
Our regressor predicts 3 parameters:
Most studies focused on the ON/OFF ratio as the objective function. Important step towards achieving better toehold performance was to redefine the objective function. When considering only the ratio, bad results could appear as good results. High ON/OFF ratio can resemble good results but in practice to be composed of low ON levels divided by lower OFF levels, which indeed end without effect on cells without the trigger but also without effect on the cells with the trigger. To consider results as satisfied we want to achieve high ON levels, high enough to obtain impact on the targeted cells, and low OFF levels to prevent impact on the non-targeted cells. Thus, we understood that the ratio by itself is not giving us enough information, and therefore we should also examine the absolute expression levels of ON and OFF separately. In this way we can achieve more insights regarding our results.
The results published in Green’s article contain only the ON/OFF ratio. Thus regarding our new conclusion, we asked them to send us the detailed results of the experiments, including ON and OFF levels separately. They kindly send us the full results of 168 toehold switches and those served us for developing our novel predictor.
Toehold switch sequence featuresThe 168 toehold switch sequences designed by Green et al shared similar structure, contains the following parameters:
Figure 1 Illustration of the toehold switch structure
At the beginning of the toehold it has a trigger binding site, a complementary part to the trigger molecule sequence; after it, a stem is created. The first part of the stem is a continuation of the complementary part to the trigger. The stem is created thanks to a complementary part in the second half of the toehold. At the middle of the stem there is a small bulb created from the presence of ‘AUG’ on the other side. At the top of the stem, a hairpin is created and contains the RBS. Following the stem there is the gene of interest, the gene we aim to express in the targeted cells.
We created 27 features for the model, including energy calculations and GC content of different toehold structure components, structural parameters, interaction energy calculations and more. Attached here is the full list of features:
Some of the features have been taken from Green et al work including the MFE calculation of the different structure components (e.g linker, stem etc.) and the identity of the nucleotide before the hairpin.
The defect calculations were inspired by another Green’s lab work by Ma et al, where they presented a method to rank toehold switches based on their previous experimental data.[2] The metric includes a combination of defect calculation, which is a probabilistic distance between the desired structure and the structure obtained in practice. The metric is described in detail in a different section on this page (Assessment Metric). We used the defect calculation as features for the model.
In the article of Green et al, they do not infer a complex model based on a combination of features, instead they report the best feature, which was the MFE calculation from RBS to end. Thus, we used this feature as a reference to our results. Any increase above it, considered for us as an improvement.
Features correlation to the labelsIn order to find the best set of features for the model without overfitting, we performed features selection using adjusted correlation calculation as the optimization criteria.
First, we chose the best feature which had the highest correlation to the label.
Attached here a pearson correlation heatmap showing the correlation between the features to each of the labels. It can be seen from the heatmap that there are two features with the highest correlation to label ON. One positive is the stem MFE and one negative which is the switch-switch interaction. For label OFF, the most correlated feature is the stem MFE. However for the label ON/OFF, the correlations are lower but the highest one is the MFE of the Switch with the GFP sequence.
Figure 2 Correlation heatmap of the features and the labels
The heatmap described the Pearson correlation between the 27 features of the model (y-axis) to the model’s labels (x-axis). The sign of the correlation indicates the relationship between the feature and the label. Positive correlation means that with the increase of the feature’s value, the label’s value is increased; negative correlation means that with the increase of the feature, the label value is reduced.
The heatmap results agree with the literature and our assumptions. First, as can be seen, the ON/OFF label seems to be less predictive than the ON and OFF separately. In addition, the most correlated features emphasize that the stem part of the toehold has the most influence on the toehold performance and should be a significant factor in designing the toehold.
Features selectionNext, in each iteration we added one feature from the list and calculated the adjusted correlation of the chosen features. The adjusted correlation equation is [10]:
where p is the total number of explanatory variable in the model, and n is the sample size
Then, we chose the feature that combined with the best feature, yields the highest adjusted correlation. We repeated this part, adding more features to the model until we received no additional improvement in the adjusted correlation. The result features that obtained for each objective appear in Table 1.
Table 1 - Best features selected for each label (along with their corresponding correlation coefficients).
Table 2 - The adjusted correlation of our best set features vs Green’s best feature
From the results it can be seen that we succeed in improving the adjusted correlation using the set of features selected by our model. It means that we improved the ability to predict the ON levels, OFF levels and ON/OFF ratio.
The regression coefficients of the three models appear in the lists below (as well as in Table 1):
ON label coeffients = [-36796.8123923 , 17799.98343884, -18296.37299861, -21000.42057797, 28222.39719668, 39041.79498465, 27075.60791862, -29849.9025678 , -4976.18341367, -4483.06546978, 5311.63763123, 19619.34307479, -20438.73825092, 4253.20887393]
OFF label coeffients= [-4513.77098889, 1894.95663112, 1338.77362406, -1892.02861732, 1909.80579024, -581.19563308, 712.72225314, -1846.70264986, 1954.02997273, -709.01412254, 964.28700903 ]
ON_OFF coeffients = [ 82.07104428, -45.52451068, 55.90282051, -19.72310897, -34.506619 , -31.93795545, -18.91773184, 16.54207188]
Our trigger selection algorithm is composed of 2 stages: mRNA selection and window selection within the mRNA molecule.
DefinitionsTrigger is a consecutive section of nucleotides of the mRNA which are predicted to hybridize with our toehold switch. We will refer to it as the trigger window in the mRNA. The trigger has a certain length which is a parameter set by the user. For a single mRNA molecule, different triggers with a certain length vary by their start offset of the window section which is a partial sequence of the mRNA.
The main data source for choosing mRNA are databases of RNA sequencing experiments, whether single cell or not.
The main data source for choosing mRNA are databases of RNA sequencing (RNA-seq) experiments, whether single cell or not. Based on the RNA-seq data, we define target cells (e.g. certain types of cancer, certain tissue, or any other differential attribute) which were monitored in the RNA-seq experiments. Such expression level information can be used to find a relevant trigger. To find mRNA differentially expressed in one type of cells vs. other type we can use statistical approach for multiple adjusted p values calculation.
Since our demand is to find a window in an mRNA that will be a good trigger, we should consider both the expression levels of the mRNA and the local folding of that window in the mRNA.
In each stage scores are generated, score for mRNA and score for a certain window in the mRNA which will serve as a trigger. Finally, the scores of the 2 stages are combined and the best candidates are chosen according to the total score.
Based on the DEG (differentially expressed genes) data we first filter out those genes with p-value above a specified threshold.
Both scores are combined by multiplication.
An example of mRNA scoring for BLCA cancer type is presented at table 3 (differential expression genes data taken from oncodb.org[34])
Table 3 - mRNA scoring method of BLCA cancer type
We use algorithms for prediction for each mRNA the mRNA secondary structure, folding free energy and nucleotides base-pairing probability matrix based on the ensemble of secondary structures. Those algorithms are probabilistic, and the accuracy gets decreased as the predicted sequence length gets longer. Based on biophysical models of the ribosome movement during translation[35], it is suggested that the ribosome unfold the mRNA during the translation thus local mRNA (insub-regions of the mRNA) folding is the more relevant measure of the accessibility of a sub window in the mRNA (as illustrated in figure 3).
Figure 3 - Ribosome unfolds mRNA during translation
Thus, we are computing averaged hybridization predictions metrics based on aggregation of two levels of moving windows on the mRNA sequence. We use an averaging window approach for gaining robustness by limiting the influence of single prediction related to one subsequence. The approach is explained below:
The length of that window is another parameter that may be chosen by the user and represents the length of the hybridization site size between the triggering mRNA and our designed toehold switch.
Thus, each trigger (the secondary window which is the inner window) appears a few times in a few different outer windows. We finally calculate for each trigger an average of the calculated metrics defined below aggregated from each of its occurrences in the different outer windows.
Figure 4 - An illustration of the two sliding windows in our algorithm for finding an efficient trigger.
In the following figures (figure 4 and Figure 6) we give an example of the trigger score for a certain gene(TK1) with certain trigger size windows(51 nucleotides) calculated by the 2 metrics defined above.
Example of averaged partial folding free energies metric (gene TK1 1269 nucleotides long) is described in figure 4.
Figure 4 - An example of averaged partial folding free energies metric (gene TK1 1269 nucleotides long)
The x axis is the number of nucleotides offset from the beginning of the gene ORF (open reading frame). The trigger window starts at that offset with a length of trigger size. The y axis is the calculated score. Higher scores are expected to be associated with positions that have unfolded secondary structure and thus can be used as triggers. We can see the 6 highest predicted trigger scores for TK1 gene with subsequence trigger window size of 51 nucleotides are in offsets (peaks marked with red circles on figure) ~160, ~180, ~1170, ~1100, ~950, ~820 (the algorithm avoids returning very close offsets to already chosen ones and will avoid returning candidates offsets close by less than about 10 nucleotides) Those marked offsets will be chosen as best candidates trigger offsets for that gene by the algorithm.
An example of averaged folding probability metric (gene TK1 1269 nucleotides long) is presented in Figure 6.
Figure 6 - An example of averaged folding probability metric (gene TK1 1269 nucleotides long)
The x axis is the number of nucleotides offset from the beginning of the gene ORF (open reading frame). The trigger window starts at that offset with a length of trigger size. The y axis is the calculated score. Higher scores are expected to be associated with positions that have unfolded secondary structure and thus can be used as triggers. We can see the 5 highest predicted trigger scores for TK1 gene with subsequence trigger window size of 51 nucleotides are in offsets (peaks marked with red circles on figure) ~975, ~1180, ~820, ~430, ~280 (the algorithm avoids returning very close offsets to already chosen ones and will avoid returning candidates offsets close by less than about 10 nucleotides) Those marked offsets will be chosen as best candidates trigger offsets for that gene by the algorithm.
By multiplying the mRNA score with the trigger score, we generate a ranking of n best candidates for toehold switch generation.
Continuing the use case example in table 3 above (i.e. choosing best mRNA for BLCA cancer type) we generate n best windows based on the combined score that considers both the mRNA expression levels and the folding of the window within it. The list of n best candidates in this case is presented in table 4 for n=10.
Table 4 - Ten Best trigger window score in mRNA molecules expressed in BLCA cancer type.
To broaden our trigger generation space, we consulted with Dr. Yoram Zarai from OncoDechipher and recieved a large data set comprised of over 8300 cancer patients suffering from various types of cancer. We analyzed this data set, which contains over 750 genes, with the intention to find mutations (or group of mutations) that are meaniningful enough for the creation of a trigger that will responde to mutated mRNA molecules but not to the healthy ones. Read more about this subject in our computational proof of concept
Figure 7 - An overview of our cancer-related mutations database, which contains more than 8300 patients and 750 mutated genes.
In order to create a sequence that utilizes both attributes, a sophisticated algorithm is needed. Starting a new algorithm from scratch will be irrelevant, as the scientific community has been dealing with this problem for years. After a literature scan, we came across NUPACK - design algorithm.
NUPACK enables the users to declare a target structure and different kinds of constraints. The algorithm considers all the parameters and generates a sequence as close to the target structure as possible. In addition to structure optimization, NUPACK can consider the interaction between several sequences, thus making it perfect for problems like generating Toehold sequences. As strong as NUPACK is, it cannot consider the much bigger picture. Trigger selection, initiation and translation factors and complex structure like pseudoknots are all examples of issues that are not addressed by NUPACK and are critical to the toehold performance. That's where we come into place, trying to close the gap between the NUPACK narrow view on the much more complex problem.
NUPACK algorithm, as mentioned before, is widely used. Therefore, its input parameters are not designed only for the Toehold sequencing problem. It supports a much wider variety of issues, thus leading to a much richer and more generic interface. In order to adjust the algorithm to suit our goals and needs, we used the input narrowly and more precisely.
One significant example is the use of domains:
NUPACK Domains enable the user to separate an area from any sequence and classify it by name and purpose. Naturally, we divided our structure into the domains presented in Figure 8.
Figure 8 - separation to domains of Toehold structure
The purpose of this division is to classify each part of the sequence with its biological meaning.
We also utilize a dynamic separation into domains in a more sophisticated manner while using our pseudoknots avoidance algorithm, as detailed in the next section.
Our model takes into account the organism of interest. It uses different structures for prokaryotes (Pardee's "Series B structue[30]) and eukaryotes (a structure selected by us after some validation experiments). In addition, organism-compatible ribosome binding site is selected - Shine-Dalgarno sequence [31] for prokaryotes and Kozak sequences for Eukaryores (while considering the differences in kozak sequences between yeast and Homo sapiens [32,33]).
Pseudoknots are formed upon base pairing of a single-stranded region of RNA in the loop of a hairpin to a stretch of complementary nucleotides elsewhere in the RNA chain [7]. Pseudoknot types are presented in Figure 9. Their occurrence can harm switch efficiency by creating a locked structure and preventing translation even in targeted cells or creating a different structure than planned.
Figure 9 - Pseudoknot types [8]
NUPACK documentation address this issue:
"In NUPACK 4, pseudoknots are excluded from the structural ensemble."In our algorithm, we are using the RNAPKplex package by Vienna RNA [5]. This package is at present the only component of the Vienna RNA package that predicts pseudoknotted RNA structure. RNAPKplex is a variant of a basic component of this package, RNAup, which computes RNA-RNA interactions by first computing the local opening energies of two RNA molecules and then computing interaction energies between those molecules. Unlike RNAup, RNAPKplex computes the interactions between positions within the same RNA molecule.
In our pipeline, right after generating the sequence, pseudoknot's presence is checked. If a pseudoknot is found, we use dynamic separation for domains to redesign the sequence in the correct position. At first, the sequence is referred to as a straight line and when a pseudoknot is found, the segment containing it gets separated from the line. This segment is defined as a new separate domain, a thing that allows us to apply constraints on it exclusively. It is important to note that this separation does not harm the whole structure because NUPACK calculates the structure using the combination of all the domains together, and not separately.
The nucleotides that form the pseudoknotted structure are defined as a new constraint that will be applied on the new defined domain, and a new sequence is designed.
When aiming for treatment in humans, an immune response should always be taken into account. For that reason, we implemented an epitope prevention algorithm. Immune response can be caused by epitopes. An epitope is a peptide, typically consisting of around 6 amino-acids, that is recognized by receptors found on B-cells as a foreign (non-self) protein. The receptors bind to the epitope by its structure and recruit the immune system in order to treat the menace.
As part of our design, some amino acids are added to the protein of interest in its 5` side, due to the fact that the translation begins before the start of the sequence of the protein (as shown in Figure 10).
Figure 10 - Visualization of added amino acids
In order to deal with this problem, we implemented an epitope prevention algorithm. In this algorithm, we first check which amino acids are translated from the nucleotides adjacent and downstream of the AUG (see Figure 10). After that, we check whether this amino-acid sequence contains a known epitope, by looking in a vast database of known epitopes taken from the literature [9].
We used the online tool IEDB by NIAID [9], and restricted our search to epitopes that are known to induce an immune response in humans (and not in other organisms), as shown in Figure 11. We got a list of 631,679 peptides and used it as our reference database.
Figure 11 - IEDB online tool
In case that an epitope is found, the nucleotides that create the problematic sequence are defined as a new constraint, and a new switch sequence is generated.
Following our consultation with Lonza Company, we learned about the immune response activation by uridine rich mRNA sequences [27], so we decided to add a novel uridine depletion algorithm to our pipeline.
Since we are not producing synthetic mRNAs, using modified uridine is not a viable option for reducing immune response activation. Another option is to switch uridine containing codons with synonymous codons without uridine. To this end, we gave a weight to each codon according to the number of uridines it includes (more uridines related to lower weight) and aimed at replacing codons with synonymous codes with higher weights. Thus, this part of our pipeline receives a sequence, and when possible, replaces uridine containing codons to the next best synonymous codon (according to typical decoding rate) which does not contain uridine.
Figure 12 - A simple illustration of replacing codons to synonymous ones for depletion of uridine from an mRNA sequence.
Following our conversation with team Aboa, we have implemented a toehold quality assessment metric for ranking our toeholds and improving our certainty, taken from a paper published by Green et al in 2018, in the synthetic biology journal [2]. This metric is based on calculations of probabilistic distance between the desired structure and the actual structure of certain elements, termed as “ensemble defects”. At first, pairwise binding probabilities of the input sequence are calculated. After that, those probabilities are used to calculate the distance between the desired structure and the predicted structure of the input sequence. A scoring function, consisting of these ensemble defects, was developed Based on an experimental data. This function is presented in Equation 3.
The parameters of this function were derived from Green’s previous research in which toehold sequences were tested in E. coli and their performance was assessed according to fluorescence intensity. In this paper, the switch general structure was divided into 4 areas of interest: toehold region, active sensor, min sensor and binding site, as shown in Figure 13 (taken from Green et al in 2018). The ensemble defects of these areas were calculated using NUPACK. When calculating defects, the desired structure has to be defined in order to be compared with the structure of the input sequence. The desired structure of those areas of interest are detailed below (see also Figure 13)
Figure 13 - Division of switch general structure to areas of interest
The defects were then normalized by the same normalization method used in the paper: For every batch of switches that were generated for the same trigger, every different defect was divided by the maximal value of the same defect over all switches, making the maximal value possible for each defect 1. After that, the score of the switch was calculated using the function from equation 1, and the switches could be ranked. It is worth mentioning that lower score indicates better switch, due to the fact that defect levels represent the distance from the desired value, and the main assumption is that the input structure is ideal and deviations from it will reduce the switch efficiency.
The mRNA part that undergoes translation is called “Open reading frame” (ORF). In the translation process, the ribosome “translates” codons to amino acids, according to the genetic code. The genetic code is the codon-amino acids pairs, and since we have 64 possible codons and only 20 amino acids, it is redundant. This redundancy led to codons being evolutionarily selected by translation machinery, and some codons are translated more efficiently than others and are usually more frequent. This concept is known as codon usage bias (CUB), and there are measures related to it that can be used to optimize translation efficiency. There are several optimization methods for choosing the right codons that we considered (see below). In addition, GC content should be taken into account when modifying the translated sequence as it is known to affect various gene expression aspects.
To optimize our sequences, we used the DNA chisel [24] optimization framework, which is an easy-to-use python library. It has two main steps – resolving constraints and optimizing according to objective. In our optimization function, we allow for GC content constraints, which is defaulted to be between 30% and 70%. In the future, other constraints could be added, such as pattern avoidance. For the optimization objectives, we allow for choosing between 3 optimization parameters that are defined below (Codon adaptation index, tRNA adaptation index and Typical Decoding Rate) and 3 DNA chisel supported codon optimization methods that are defined below (use best codon, match codon usage and relative codon adaptiveness).
Over the years, several codon usage bias indices have been developed, based on different characteristics. In our software, we allow translation optimization based on 3 different indices which are described below.
The codon adaptation index of a gene is a statistical approach that estimates the extent of its bias towards codons that are known to be favored in a reference set of genes. Usually, this set of genes is composed either of highly expressed genes or the whole genome. The chosen set of genes is then used to calculate the weights of each codon, also called the “relative adaptiveness”, as seen in the equation 4.
Xi - the number of appearances of the codon in the reference set.
xmax - the number of appearances of the most frequently used codon for an amino acid (from the synonymous codons).
This is equal to the RSCU ratio, which is defined as the observed appearance of the codon divided by the expected appearance under equal use of synonymous codons.
● In case a codon is not observable in the reference set, we assign its count to 0.5, so that its weight is set to:
These calculated weights can later be used to optimize translation efficiency. By switching codons with lower weights to codons with higher weights, the translation process is expected to be affected positively.
For our software, we calculated CAI weights for Homo sapiens, S.cerevisiae and E.coli, based on frequency data from Kazusa database [21]. Optimization for other organisms could be added in the future.
Another index for codon usage bias is the tRNA adaptaion index. In contrast to the statistical codon adaptation index, tAI is biophysical measure, which assumes that highly expressed genes are adapted to the organisms tRNA pool (the supply of the tRNA molecules that match specific codons). Codons that their corresponding tRNA are readily available are considered more optimal, as the efficiency of their translation is not decreased due to the lack of the tRNA molecule.
To calculate the weights of each codon, an estimation of the tRNA pool of the organism is used, along with the consideration of tRNA-codon pairs effectiveness, including both Watson-Crick and wobble interactions.
As measuring tRNA levels is not a simple undertaking, tAI uses the number of copies of each tRNA in the genome, which is shown to be correlated with tRNA expression levels [22]. The copy number of tRNAs can be extracted from http://gtrnadb.ucsc.edu/index.html.
Figure 15 - Example of Homo sapiens tRNA copy number tables
The weights are then calculated for each codon according to equation 6:
Where i stands for each codon (i.e. the index of the codon), n is the number of tRNAs that recognize codon i and Sij values are a measure of effectiveness of the tRNA-codon interaction, fitted by experimental protein abundance data [23].
Finally, the weights are normalized according to the most optimal codon overall (and not only according to synonymous codons as with CAI), as in equation 7.
However, copy numbers and corresponding calculated weights are readily available in the supplementary materials of Tuller et.al. 2010 article [11], so we did not calculate them on our own. Nevertheless, we validated that the weights are in agreement with the data on the website (as biological data is regularly updated with results from new studies). For our software, we used tAI weights for Homo sapiens, S.cerevisiae and E.coli, but tAI weights for other organisms can be added in the future.
Typical decoding rate is an empirical measure which is calculated from experimental ribo-seq data. In ribo-seq experiments[29], we gather information on the presence of the ribosome on each codon (footprint count), where the more reads on a certain codon means that it takes the ribosome longer to translate it.
Figure 16 - Decoding time estimation, figure taken from Tuller et al. 2014 [11]
Translation of codons by ribosomes is stopped and exposed mRNAs are digested. Protected mRNA footprints are then sequenced and mapped to the genome, creating a footprint count (FC) profile for each gene.
For the calculation of TDR weights, codon footprint counts are taken from all mRNAs and are used to estimate the decoding time.
First, a normalized footprint count for each ORF is calculated according to equation 8.
Where FCi is the footprint count (or read count) for codon i and it is normalized by the average footprint count over all codons of the ORF (in the case of the equation above the ORF includes L codons). Then, we generate the distribution of normalized footprint counts (NFC) of each of the codons based on the values of the NFC of the codon over all its positions in all the mRNAs (see Figure 17). Finally an exponentially modified gaussian (EMG) distribution [28] is fitted each of the empirical distributions.
Figure 17 - Example of normalized footprint count (or normalized read count) distribution for two codons, taken from Tuller et al. 2014 [11].
The typical decoding rate of a codon is the parameter related to the mean of the Gaussian component of the EMG distribution fitted to its NFC distribution. . The reciprocal of the typical decoding time is the typical decoding rate which is used as the weight for each codon. All the details of this index appear in Tuller et al. 2014 [11].
Since we did not have the resources to perform ribo-seq experiments, we based our weights over existing typical decoding times taken from this database [11,12] which used ribosomes profiles from [13-19], for Homo sapiens, S.cerevisiae and E.coli, but weights for other organisms can be added in the future.