CRISPRLY

Machine Learning Model

The CRISPR systems have successfully been adapted to edit or detect the genome of various organisms. However, our ability to predict the editing outcome at specific sites is limited. The precision of DNA editing (i.e., recurrence of a specific indel) varies considerably among sites, with some targets showing one highly preferred indel(insertions and deletions) and others displaying numerous infrequent indels[1].


It’s time-consuming and laborious to test all guide RNAs before starting a gene-editing experiment. In silico guide RNA design is crucial for successful genome editing. Several online tools and applications have been developed for the design of guide RNAs. There are also several excellent reviews and articles comprehensively summarizing and benchmarking these tools. Despite considerable efforts to date, predicting the activity and specificity of guide RNAs is still challenging. In addition, most of the tools and methods are developed for Cas9. The number of tools and methods for Cas12a is relatively limited. Therefore, there is an urgent need to develop new computational tools for Cas12a. In this work, we propose a deep learning approach to evaluate the performance of guide RNAs by predicting the insertion-deletion frequency also known as the indel frequency. Our approach of using two convolutional neural network classifiers stems from classification strategies used in image classification, where a first classifier predicts on-target activity using the matched DNA sequences and a second classifier predicts off-target effects using the mismatched DNA sequences. Each classifier is composed of a combination of “one-hot” feature representations. To capture the important characteristic of functional guide RNAs, we present the permutation importance analysis on the neurons extracted by the convolution and pooling processes and map top neurons to the original input matrix. 

Convolution Neural Network

A convolutional neural network (CNN) is a type of feed-forward artificial neural network. The key aspect of CNNs is that they can learn hierarchical spatial representations, rather than relying on laborious manual feature engineering. The architectural components of a CNN include three types of layers: convolution layers, pooling layers, and fully connected layers. In the convolution layers, weight vectors called filters are multiplied across the subregions of all the data. They enable CNNs to discover locally correlated patterns regardless of their locations in the data. The pooling layers perform the maximum or average subsampling of non-overlapping subregions, providing invariance to local transitions. The fully connected layers aggregate local features into more highly abstract features by computing weighted sums and applying nonlinear functions.

https://www.mdpi.com/1424-8220/21/8/2866 

Designed to analyze spatial information, CNNs have made major advances in various tasks such as image recognition and natural language processing. The amount of data required for proper CNN model development can vary considerably depending on the objectives of each task, the data complexity, and other factors; nevertheless, a rough rule of thumb is that 5,000 labelled examples per category would generally be sufficient for acceptable performance. In bioinformatics, CNNs are also showing great promise for genomic sequence analysis. Traditional approaches in genomic sequence analysis often incorporate hard-coded position weight matrices (PWMs) to identify regulatory motifs[2]. On the other hand, an initial convolution layer in a CNN corresponds to motif detectors where PWMs are not hard-coded but solely learned from data. Prior studies have demonstrated that CNNs can outperform state-of-the-art methods in diverse applications, including predictions of transcription factor binding affinity and DNA sequence accessibility

Deep Cas12a

This model was inspired from the Deep Cpf1 architecture which was developed for the prediction of Cpf1 indel frequencies based on the sequences i.e the 34 base-pair guide RNA .DeepCpf1 is a deep-learning framework for AsCpf1 indel frequency prediction. DeepCpf1 receives a 34-bp target sequence as input, and it produces a regression score that highly correlates with AsCpf1 activityDeepCpf1 eliminates the need for laborious manual feature engineering, leveraged by the use of a CNN. DeepCpf1 can thus automatically learn informative representations of target sequences relevant to AsCpf1 activity profiles.Our model was inspired from the DeepCpf1 architecture and is modified to make it suitable for the prediction of Cas12a indel frequencies.

Libraries used

Tensorflow for building the Neural network layers, pandas for accessing and modifying the dataset , matplotlib for generating plots , numpy for doing mathematical operations. The functions used in the libraries are compatible with the pythonV3 

Dataset

We first referred to large-scale data sets of AsCpf1 activity at target sequences measured using a high-throughput Next Generation Sequencing method and 20-nt guide sequences in HEK293T cells. The high-throughput experiments led to the generation of this data set which consists of target sequence compositions and corresponding indel frequencies[1]Our dataset is 20826 rows long, in which each record contains a 34(30+4) base pair sequence and its corresponding indel frequency. The indel frequencies are rounded upto 8 decimal places. 

Preprocessing

We have a 34 nucleotide long string sequence which comprises only 4 characters each corresponding to the 4 nucleotides of the DNA.
The one-hot encoding input layer converts the sequence into numerical representations for downstream processing. It encodes the nucleotide in each position as a four-dimensional binary vector, in which each element represents the type of nucleotide: A, C, G, and T. The encoding layer then concatenates the binary vectors into a 4-by-34 dimensional binary matrix representing the whole 34-bp target sequence.

For example, 
‘A’ is coded as [[‘1’],[‘0’],[‘0’],[‘0’]] 
‘T’ is coded as [[‘0’],[‘1’],[‘0’],[‘0’]]
‘G’ is coded as [[‘0’],[‘0’],[‘1’],[‘0’]] 
‘C’ is coded as [[‘0’],[‘0’],[‘0’],[‘1’]]

Thus we have converted our problem into an image-processing problem
This will represent our sequence in a grayscale image format of (34,4)
Example :

Training and the testing data

First, we split data set into training sets n = 18700 and a testing set of n=2126 by random sampling. The training set was used for model selection and pre-training, and while the testing dataset was only used to evaluate the performance of the model and never used in the training

Building the model

We have used the CNN architecture to process the input data. The architecture comprises of

  • 3 Convolutional layers
    • 1st layer has a kernel size of 1X3 with 12 filters
    • 2nd layer has a kernel size 1X4 with 10 filters
    • 3rd layers has a kernel size of 1X5 with 8 filters
  • Flattening layer to condense the CNN output to a 1-D matrix and feed it to the fully connected network
  • Fully connected layer of 58 neurons with a dropout layer of 0.3 and with linear activation function
  • Final layer ( Output layer) of 1 neuron with softmax as activation function

Training of the model

We have fitted the model with the training data which we have split earlier from the main dataset, and we have run it for 300 epochs with a validation split of 0.2 and with a batch size of 32.

Results

We used the statistical testing parameter of mean absolute error  for determining the statistical significance of DeepCpf1 prediction scoresMean Absolute Error calculates the average difference between the calculated values and actual values. It is also known as scale-dependent accuracy as it calculates error in observations taken on the same scale. It is used as evaluation metrics for regression models in machine learning. It calculates errors between actual values and values predicted by the model. It is used to predict the accuracy of the machine learning modelAfter training as specified, we have got the loss in the validation set  to be 13.5850 and the Mean Absolute Error ( MAE) to be 12.9708

Future Direction

The results obtained using this model are not good enough considering the model to be dependent as we got the Mean absolute error to be 12.97 whereas we had expected it to be around 1-2%. The Dry Lab team is consistently working to reduce this difference and hopefully will be successful in doing so in near future.

  • Kim, H., Min, S., Song, M. et al. Deep learning improves prediction of CRISPR–Cpf1 guide RNA activity. Nat Biotechnol 36, 239–241 (2018). https://doi.org/10.1038/nbt.4061
  • Luo, J., Chen, W., Xue, L. et al. Prediction of activity and specificity of CRISPR-Cpf1 using convolutional deep learning neural networks. BMC Bioinformatics 20, 332 (2019). https://doi.org/10.1186/s12859-019-2939-6