Initial considerations and model justification
Since the process for designing a sRNA sequence for regulating the expression of a desired gene not takes into account the thermodynamics of RNA folding and the behavior of a given sRNA sequence cannot be fully predicted according to literature (see software), our team decided to incorporate to our software three models:
- One for scoring the likelihood of a given sequence to become a “true sRNA” adapted from the work of Kumar and collaborators (2021) [1] who created a score for filtering bacterial RNA sequences in order to distinguish between true and false sRNAs
- The second one takes into account the hybridization efficiency between the designed sRNA and the corresponding mRNA target. This was adapted from the paper of Vazquez-Anderson and collaborators [5] who estimated the correlation between the hybridization efficiency of an asRNA (antisense RNA) and its target (considering cell conditions) and the free energy of sRNA and mRNA self-folding on the desired hybridization site with an accessibility factor that measures the probability of the target mRNA region of being unpaired
- A neural network-based model for predicting the behavior of the created sRNAs: upregulation or downregulation. This one was trained with a database filled by our team with reported sRNA:mRNA pairs
The models 1 and 2 are used for sRNA scoring, and all of the sRNA parts that our team created for this project and then registered on the parts register were created using this scoring criteria (see software) because the incorporation of or third model was done after we ordered the DNA synthesis
Termodinamical profiling and RNA conformational model
Model I
(Kumar et al., 2021) [1] developed an online platform for predicting possible sRNA sequences and its respective targets, giving a nucleotide sequence and a bacterial genome as an input. As part of PresRAT (the developed webserver) the authors developed a scoring formula for profiling the likelihood of a given sequence to be a true sRNA by analyzing its secondary structures, free energy profiles, sequence length and nucleotide sequence. This scoring model is described on equation (1)
Equation 1. Scoring model proposed by Kumar et al. (2021)
According to the authors, this scoring formula can be used for filtering true sRNA sequences from false ones. It is important to note that each parameter is weighted equally in comparison with each other.
Sequence score
The Sequence score is calculated based on the analysis of the entire sequence in fragments of two nucleotides over a 2-nucleotide sliding window in 5’ to 3’ direction. Depending upon the dinucleotide combination, a determined punctuation is given based on a 4x4 matrix, then all the punctuations are added together and divided by the length of the sRNA sequence.
The RNA sequence is analyzed over a 2-nucleotide sliding window according to the matrix on figure (1) and following the sum criteria on figure (2)
Figure 2. Crieria for determining dinucleotide punctuation
Once the individual punctuations are obtained, the final Sequence score is calculated according to the equation (2)
Equation 2. Sequence score calculation
Figure 1. 4x4 matrix for calculating dinucleotide punctuation
Equation 3. Uracil-load calculation on sRNA sequence
Uracil-load score
The U score only consideres the uracil load on the sRNA sequence and is calculated according to formula (3)
RNA free energy
The free energy of an RNA molecule is affected by: the number, composition and arrangement of nucleotides in its sequence. Longer sequences are more stable because they can form more pairings (Trotta, 2014) [2]. The Gibbs free energy [G] combines enthalpy and entropy into a single value. The change in free energy (∆G) is equal to the sum of the enthalpy plus the product of the temperature and entropy of the system. ∆G can predict the direction of a chemical reaction under constant conditions of T and P.
If ∆G > 0, the reaction is not spontaneus
If ∆G < 0, the reaction is spontaneus
RNA folding landscape and local minima
The unfolded RNA can take different conformations, and the probabilities of forming those structures are described as free energy landscapes depending on the energies of each conformation. In a RNA energy landscape, the formed structures become more and more energetically favorable until achieving the native structure (Woodson, 2010) [3]. The most stable conformation is associated with the minimum free energy of a given RNA sequence (Kucharík et al., 2013) [4].
Figure 4. Schematic representation of a RNA energy landscape. Local minima are represented with numbers 1-3, while the global minimum (MFE: Minimum free energy) is labeled with ⨳. Adapted from Kucharik et al. (2013) [4]
As summarized by (Kumar et al., 2021) [1], the energy landscape of a RNA molecule can be described as a 2D graph with the free energy of all possible RNA conformations versus the structure of such conformations; considering the previously mentioned, a local minima RNA structure is a secondary RNA conformation with a lower free energy than the previous and following one on the conformational landscape, as noted in the figure (4)
It is important to note that whereas there are a few local minima on an energy landscape, each one with an associated free energy, the global minimum is the most stable conformation, and its associated energy is denoted as the minimum free energy (MFE) of the system (Kucharík et al., 2013).
Average Base Energy (ABE) score
Considering the latter (Kumar et al., 2021) [1] defined the Average base energy score (ABEscore) as the average free energy values of all the local minima within the energy landscape of a given RNA sequence divided by the length of the sequence. According to the authors, this score is used for considering the average contribution of each base to the formation of local minima structures. The used formula is shown on equation (4)
Equation 4. ABE score calculation
Average Loop Energy (ALE) score
The final score used by (Kumar et al., 2021) [1] is the Average loop energy ALEscore, which is used for analyzing the stability of the RNA loops within all the local minima structures. Greater the ALEscore, the overall RNA score is reduced; a high total loop energy in a specific local minima increases the ALEscore, and thus contributing to reduce the overall RNA score. The equation (5) is used for this score calculation
Equation 5. ALE score calculation
RNA Folding and base pairing free energies model adapted with target accessibility mRNA factor
Model II
On the other hand, (Vazquez-Anderson et al., 2017) [5] developed a model for studying the hybridization efficiency between a short RNA and its target, regardless of the short RNA type; for its study they used an antisense RNA (asRNA) as a model for in vitro testing of the hybridization efficiency and thus receiving feedback to their mathematical model in order to make the necessary adjustments. The authors indicate that the term ΔG overall noted by the equation (6) is the free energy change related to all the steps involved on the hybridization between the asRNA (sRNA in our case) and the target RNA region, lower the ΔG overall, greater the chances of an adequate hybridization process
Equation 6. Free energy of a asRNA (antisense RNA) and a mRNA basepairing
As the hybridization free energy between our sRNA and the target becomes more negative, the reaction is more likely to occur successfully, while low free energy values for the self-folding of the sRNA (∆G asRNA unfolding) and the target mRNA region (∆G Target unfolding) individually elevates the ΔG overall making it more difficult to bind the sRNA and the mRNA because the individual secondary structures of each one are energetically favorable and thus difficult to unbind itseves and then pair between them.
Equation 7. Baseline correlation between hybridization effiency v and ΔG overall
The correlation between the ΔG overall and the hybridization efficiency is noted on equation (7), where v is the hybridization efficacy and is a constant with a given value depending upon the tested system and the units/methodology used for measuring v
However, the authors [5] developed an in vivo Thermodynamic Accesibility-adjusted (inTherAcc) model considering the folding free energies and an accessibility factor, denoted as . It is important to note that this model comes from the experimental optimization of the thermodynamic parameters, but does not take into account the term ∆GasRNA unfolding from equation (6). The inTherAcc model is noted on equation (8)
Equation 8. inTherAcc model for hybridization efficiency prediction between an asRNA and a mRNA target
Where the accessibility factor is calculated using the formula (9)
Equation 9. Calculation of the global accesibility factor for a given mRNA (target) sequence
The local availability factor θi is assigned to each nucleotide within the target mRNA region, and is equal to one minus the base-pairing probabilities of that nucleotide and the rest of nucleotides of the target region.
Final integration of models I and II
Normalization and equal ponderation
Because both models have a different value range, it is necessary to normalize both in a 0-1 scale and then sum both scores for each studied sRNA-mRNA pair in order to give an equal ponderation.
Our final model is synthetized on equation (10) for scoring the sRNAs and its corresponding mRNAs
Equation 10. Final scoring formula for calculating the likelihood of a given sRNA to hybridize efficiently and having structural and thermodinamical properties of true sRNAs
Neural network-based model for sRNA role prediction
Model III
Once the scoring of the sRNAs was established, we got into a question: Is there a way to predict sRNA function?
Research
After several futile attempts to find either a mathematical or computational model in literature able to accurately predict sRNA functioning (that is, whether its interaction with the target mRNA will result in an up- or down-regulation), we concluded a new approach for its prediction was due
Design
During our research we realized that there might be enough sRNA-mRNA interactions reported in literature to build a functioning machine learning model. Nonetheless, these were spread across multiple databases and publications. For our first iteration, we used [ ]’s dataset to construct a baseline model consisting of a neural network with only a few layers and neurons. The hybridization window was obtained through the RNAup library and then this was used to calculate features derived from our sRNA designer model: ALE score, ABE score, U score, sequence score, mRNA-sRNA base pairing minimum free energy, and self-folding mRNA target region minimum free energy. Unfortunately, this initial model was not able to capture the behavior of the data, suffering from both high variance and bias. Thus, in order to further improve our predictions, we had to gather and homogenize observations from other sources with the help of the NCBI database. This homogenization was needed because many of these interactions did not readily contain all the data needed to calculate the desired features. This being done, we then had to look for new features as well as apply careful preprocessing.
Model
During our research we realized that there might be enough sRNA-mRNA interactions reported in literature to build a functioning machine learning model. Nonetheless, these were spread across multiple databases and publications. For our first iteration, we used [ ]’s dataset to construct a baseline model consisting of a neural network with only a few layers and neurons. The hybridization window was obtained through the RNAup library and then this was used to calculate features derived from our sRNA designer model: ALE score, ABE score, U score, sequence score, mRNA-sRNA base pairing minimum free energy, and self-folding mRNA target region minimum free energy. Unfortunately, this initial model was not able to capture the behavior of the data, suffering from both high variance and bias. Thus, in order to further improve our predictions, we had to gather and homogenize observations from other sources with the help of the NCBI database. This homogenization was needed because many of these interactions did not readily contain all the data needed to calculate the desired features. This being done, we then had to look for new features as well as apply careful preprocessing.
For our model’s design, the sequences were first preprocessed since many of them were not candidates and had to be dropped. The positions of hybridization were found and used to calculate training parameters. The following features, as well as their squared value, were calculated with the help of NuPack and ViennaRNA:
Table 1. Parameters used for neural network model training
Specifically, our model was composed of a 40-neuron input layer followed by a BatchNormalization layer, our two main layers (each dense with 128 neurons and a ReLu activation function) which were then fed to a Dropout layer to avoid overfitting and finally an output layer with a sigmoid activation function. All of these with kernel, bias, and activity regularizers.
Figure 5. Very high-level structure of the program’s pipeline. (Labels and Features mixed up)
Specifically, our model was composed of a 40-neuron input layer followed by a BatchNormalization layer, our two main layers (each dense with 128 neurons and a ReLu activation function) which were then fed to a Dropout layer to avoid overfitting and finally an output layer with a sigmoid activation function. All of these with kernel, bias, and activity regularizers.
Figure 6. A peek inside a neural network, with a view of its layers, structure and weights
In order to optimize the neural network’s hyperparameters, libraries from Keras, such as Gridsearch and Randomsearch, were used. After 300 epochs, the model in some cases achieved an outstanding 0.92 and 0.91 F1 score for down and up-regulation respectively.
Figure 7. Plot of training epochs against both their train and test set accuracies
The trained model was then incorporated into the software’s pipeline to give the user a prediction for the designed sRNA’s function along with a confidence score for it. For both upregulation and downregulation, we tried to choose the sRNAs that best balanced a high confidence with a high combined thermodynamic score. Even if a simple average could seem like the obvious choice, the right one is to calculate the harmonic mean of both, since this metric penalizes more harshly having either of them being too low.
Harmonic mean formula
The model does have a harder time predicting upregulating sRNAs with high confidence which could be a reflection of the unfortunate small size and unbalance that our dataset suffers. There is also a hint of high variance in our mode, which we somewhat controlled using Keras’ bias regularizers. We also had our shot at applying principal component analysis to our features in an attempt to reduce its dimensionality mainly to reduce training time. Nonetheless, we did not manage to find a good compromise between dimensionality reduction and model performance so the idea was abandoned for now.
There were also intentions to incorporate an XGBoosted tree model to create an ensemble model that could hopefully further increase performance in both terms of metrics and generality. Unfortunately, due to time constraints, the model could not be implemented into the final version of the software.
References
- Kumar, K., Chakraborty, A., & Chakrabarti, S. (2021). PresRAT: A server for identification of bacterial small-RNA sequences and their targets with probable binding region. RNA Biology, 18(8), 1152–1159. https://doi.org/10.1080/15476286.2020.1836455
- Trotta, E. (2014). On the Normalization of the Minimum Free Energy of RNAs by Sequence Length. PLoS ONE, 9(11), e113380. https://doi.org/10.1371/journal.pone.0113380
- Woodson, S. A. (2010). Compact Intermediates in RNA Folding. Annual Review of Biophysics, 39(1), 61–77. https://doi.org/10.1146/annurev.biophys.093008.131334
- Kucharík, M., Hofacker, I. L., Stadler, P. F., & Qin, J. (2014). Basin Hopping Graph: A computational framework to characterize RNA folding landscapes. Bioinformatics, 30(14), 2009–2017. https://doi.org/10.1093/bioinformatics/btu156
- Vazquez-Anderson J, Mihailovic MK, Baldridge KC, Reyes KG, Haning K, Cho SH, Amador P, Powell WB, Contreras LM. Optimization of a novel biophysical model using large scale in vivo antisense hybridization data displays improved prediction capabilities of structurally accessible RNA regions. Nucleic Acids Res. 2017 May 19;45(9):5523-5538. doi: 10.1093/nar/gkx115. PMID: 28334800; PMCID: PMC5435917