Model Image

Model

Part 1 Off-target Prediction

1. Abstract

CRISPR-Cas9 is a popular and easy to use gene editing technique which allows manipulation of specific DNA fragments, but it has off-target risk. Recently, significant progress has been made in the field of off-target prediction on the basis of deep learning. However, the existing methods still can not satisfy enough precision in off-target predictions for gene editing and almost none of them has taken physicochemical descriptors of sgRNA-DNA into account. We designed a model to predict the off-target of sgRNA at specific DNA fragments in CRISPR-Cas9 gene editing based on DenseNet[1]. Our model performed well, the auROC and auPRC could reach 0.889 and 0.679, respectively. In addition, we compared the model with CFD[2], MIT[3], CNN_Crispr[4] and DeepCrispr[5], demonstrating the competitive edges of the proposed algorithm.

2. Introduction

The CRISPR/Cas9 (Clustered Regularly Interspaced Short Palindromic Repeats with Cas9) system is a well-sought technology for precise gene editing in the world[7]-[10]. The CRISPR/Cas9 system that contains sgRNA (single-guide RNA) and Cas9 protein has been applied in gene knockout, detection, labeled, and transcriptional regulation. In nature, the system plays an important role in the immune defense of certain bacteria. Researchers were able to use this technique to guide small pieces of RNA to specific locations in complex genomes[11] DNA sequences can therefore be easily edited or modulated in a variety of species and cell types, including human cell lines, bacteria, zebrafish and monkeys[12],[13]. The CRISPR-Cas9 technique has been used for personalized therapy to edit and modulate harmful genes[14],[15]. For instance, it has been applied by Ma et al. (2017)[16] to correct a pathogenic mutation in a human embryos.

Although specific fragments of DNA are aimed, sgRNA can sometimes influence other regions and incur off-target[17]. The specificity of CRISPR/Cas9 system mainly depends on the recognition sequence of sgRNA. As the designed sgRNAs may be mismatched with non-target DNA sequences, resulting in unexpected gene mutation, this effect is called off-target [3],[18]. Off-target can lead to genomic instability and disturb the normal gene functions, which is still a major problem when applying CRISPR-Cas9 gene editing to clinical applications[19]. The recent application of deep learning in genomics research has showed its applicability[20]. Front researchers have applied deep learning method to off-target prediction of CRISPR/Cas9. For example, Lin took advantage of deep learning and developed two deep neural networks models to address the current problems including feed forward neural network (FNN) and CNN for off-target predictions of CRISPR/Cas9[21].

DenseNet is an outstanding neural networks model based on CNN[1]. We encoded each sgRNA-DNA sequence by one-hot encoding to form a 4×23 matrix, thus completing the transformation from gene information to computer vision information. It can realize the self-adaptation of DenseNet from computer vision to gene sequence. Besides, we use python package PyBioMed to extract physicochemical descriptors to add more features to sequences.

To sum up, our main innovative points in project are summarized below:

1. We made the first attempt to apply DenseNet to off-target prediction in CRISPR-Cas9 gene editing.

2. We extracted the physicochemical descriptor information from each sgRNA-DNA sequence pair and form a matrix.

3. Result

3.1 Model structure and prediction

Our model uses DenseNet to perform classification tasks after extracting features based on sequence information and physicochemical descriptors.

We used the data that has been published in the DeepCrispr article as our training data. The off-target data set contains a total of 29 sgRNA from two different cell types: 293-related cell lines and K562t. The labels of off-target sites were set to“1”, and the labels of on-target sites were set to “0”. The data volume ratio of training set, verification set, and test set is 7:2:1

The sgRNA sequence and its corresponding DNA sequence are both comprised of 23 bases. We used T to replace U in sgRNA, so each base in the sgRNA and target DNA can be encoded as one of the four one-hot vectors [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0] and [0, 0, 0, 1]. As a result, every sequence can be represented by a 4 × 23 matrix .

Structural and physiochemical descriptors extracted from sequence data have been widely used to represent sequences and predict structural, functional, expression and interaction profiles of proteins and peptides as well as DNAs/RNAs. In this project, we turned to PyBioMed[6], a Python-based multifunctional toolkit for generating various numerical feature representation for DNA, protein, and peptide sequences. We tried to improve the accuracy of sgRNA off-target prediction by extracting the physicochemical descriptors from the DNA sequence with PyBioMed, and used it as part of input of the DenseNet network together with the one-hot encoding of the DNA sequence.

DenseNet directly reuses features by establishing dense connections between all the previous layers and the following layers. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers[1]. DenseNet consists of DenseBlocks and transition layers. Each unit in the DenseBlock is actually a bottleneck layer, including a 1x1 conv and a 3x3 conv. There is also a transition layer between each block, including a BN, a 1x1 conv and a pooling. This structure can not only alleviate gradient disappearance, reduce parameters and calculations, and resist over fitting, but also reduce redundancy [1].

img

Fig. 1. Design of model.Our model consists of 3 Dense Blocks.Each Dense Block is composed of 6 Bottleneck Layers.For every Bottleneck Layer,the order of components is BN+ReLU+1×1Conv+BN+ReLU+3×3Conv.

img

Fig. 2. ROC curve and PRC curve of DUT_model

3.2 Model comparing

We selected four sgRNA off-target propensity prediction models for model comparison, namely CFD[2], MIT[3], CNN_Crispr[4] and DeepCrispr[5]. CFD is a scoring model for evaluating the off-target propensity of sgRNA-DNA interaction, and specified different scores for the location and type of mismatch between sgRNA and corresponding DNA sequence. When multiple mismatches appear in the sequence pair, the corresponding scores are multiplied to obtain the final score. DeepCrispr used the largest data set available to conduct model training and introduced the auto-encoder to automatically acquire potential features of the sgRNA-DNA sequence[5]. CnnCrispr attempted new feature representation methods to embed sequence information into the deep learning model, combining RNN with CNN[4]. We downloaded the prediction models from related websites .Then,we trained the models on the same test set as our model and compare the auROC and auPRC.

img

Fig. 3. Visualization of auROC and auPRC value in five models

Our model achieved an auROC value of 0.889 and an auPRC value of 0.694 at the total test set. The auROC value is slightly lower than CNN_Crispr and DeepCrispr. In addition, the auPRC value of 0.694 is higher than the value of CFD, MIT and DeepCrispr. Moreover, the dataset was still unbalanced even we undersampled. The PRC curve and the area under it were more important measures for model evaluation, where our model had a stronger competitive advantage than others.

3.3 Performance in wet lab

The sequences used in this model were derived from human databases due to limitations in obtaining the E. coli (Escherichia coli) target database. In order to explore the application value of this model to E. coli, we decided to compare some key enzymes of E. coli and human (Homo sapiens). We obtained the amino acid sequences of these enzymes of E. coli K-12 substr. MG1655 reference genome and the nucleic acid sequences of the corresponding genes from the Biocyc database. We used the BLAST tool of NCBI(National Center for Biotechnology Information) for analysis. We also used the same method to compare the above-mentioned enzymes of yeast (Saccharomyces cerevisiae S288c) and mice (Mus musculus) with human to provide auxiliary reference information. The results showed that there were some similarities between human and E. coli in DNA sequences of enzymes in major metabolic pathways.

img

Fig. 4. The Result of the amino acid sequences blasting. Except for a few enzymes whose corresponding nucleic acid sequences are too low, most of them have a similarity of more than 60%, and the highest 6-phosphofructokinase reaches about 83%. From this analysis, our model has good application value for E. coli.

From the analysis, we predicted that our model can be applied to E. coli. Next, we predicted the sgRNA used in the wet experiment. The labels of off-target sites were set to“1”, and the labels of other sites were set to “0”.All known labels come from our wet experimental results. Deep learning can automatically learn features from big data, including thousands of parameters. Theoretically, deep learning can map any function, so it can solve very complex problems. So as more data about Crispr are accumulated, our model will show better performance.

3.4 Result of Prediction

img

Almost all predictions are correct, indicating that our model can successfully help reduce the amount of wet experiments and eliminate the factors of experimental failure.

4. Discussion

In our work, we innovatively use DenseNet, a deep network widely used in computer vision, for nucleic acid data and to achieve excellent results. In the past, serialized data such as nucleic acid data were often converted into deep learning networks processed by natural languages, such as RNN[22] or transformer[23]. Nevertheless, for our input data, a sparse matrix is also appropriate to use DenseNet to process it.

With the rise of computational biology, descriptors have been widely used in protein structure metrology[24]. Descriptors can interpret the structure and physical and chemical properties of biological macromolecules well with a small amount of data. In this study, we added a PyBioMed-based DNA descriptor as one of the inputs[6]. Fortunately, the continuous evolution of sequencing-based CRISPR off-target detection technologies has resulted in the generation of many CRISPR databases[24, 25]. These massive amounts of data can support our deep learning. However, most of the existing data are measured by in vitro reactions, and there are non-negligible differences with in vivo reactions. In the future, multi-species intracellular off-target data needs to be explored together.

In the few years of the rapid development of neural networks and deep learning, people have developed suitable network models or concepts for dealing with various problems. Today, when deep learning has begun to be widely used in the biological field, is there any suitable network for biological problems? CRISPR gene editing is a model system for protein-RNA-DNA interactions. Can its interpretation and modeling be transferred to studying complex biological macromolecular interaction networks? These issues deserve our continued study.

Calculation Formulas Applied in Our Model

img

Our network comprises L layers, each of which implements a non-linear transformation Hl(·), where l indexes the layer. We denote the output of the lth layer as xl. And [x0, x1, . . . , xl-1] refers to the concatenation of the feature-maps produced in layers 0, . . . , l-1.

img

Convolution calculation formula, where k is the parameter and b is the offset.

img

Formula of full connection layer, where W is the parameter matrix and B is the offset vector.

img

Back propagation formula, where L is loss function.

5. References

[1] Huang G, Liu Z, Maaten L and Weinberger KQ. Densely Connected Convolutional Networks. 2017 IEEE CVPR. 2017, pp: 2261-2269.

[2] Doench JG, Fusi N, Sullender M, Hegde M, Vaimberg EW, Donovan KF, Smith I, Tothova Z, Wilen C, Orchard R, Virgin HW, Listgarten J, Root DE. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat Biotechnol. 2016, 34 (2): 184-191.

[3] Hsu PD, Scott DA, Weinstein JA, Ran FA, Konermann S, Agarwala V, Li, YQ, Fine EJ, Wu XB, Shalem O, Cradick TJ, Marraffini LA, Bao G, Zhang F. DNA targeting specificity of RNA-guided Cas9 nucleases. Nat Biotechnol. 2013, 31 (9): 827-832.

[4] Liu QY, Cheng X, Liu G, Li BH, Liu XQ. Deep learning improves the ability of sgRNA off-target propensity prediction. BMC Bioinformatics. 2020, 21 (1): 51.

[5] Chen Z, Zhao P, Li FY, Leier A, Marquez-Lago TT, Wang YN, Webb GI, Smith AI, Daly RJ, Chou KC, Song JN. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics. 2018, 34 (14): 2499-2502.

[6] Cong L, Ran FA, Cox D, Lin SL, Barretto R, Habib N, Hsu PD, Wu XB, Jiang WY, Marraffini LA, Zhang F. Multiplex Genome Engineering Using CRISPR/Cas Systems. Science. 2013, 339 (6121): 819-823.

[7] Esvelt KM, Mali P, Braff JL, Moosburner M, Yaung SJ, Church GM. Orthogonal Cas9 proteins for RNA-guided gene regulation and editing. Nat Methods. 2013, 10 (11): 1116-1121.

[8] Mali P, Yang LH, Esvelt KM, Aach J, Guell M, DiCarlo JE, Norville JE, Church GM. RNA-Guided Human Genome Engineering via Cas9. Science. 2013, 339 (6121): 823-826.

[9] Ran FA, Hsu PD, Lin CY, Gootenberg JS, Konermann S, Trevino AE, Scott DA, Inoue A, Matoba S, Zhang Y, Zhang F. Double Nicking by RNA-Guided CRISPR Cas9 for Enhanced Genome Editing Specificity. Cell. 2013, 154 (6): 1380-1389.

[10] Sander, JD, Joung JK. CRISPR-Cas systems for editing, regulating and targeting genomes. Nat Biotechnol. 2014, 32 (4): 347-355.

[11] Hsu PD, Lander ES, Zhang F. Development and Applications of CRISPR-Cas9 for Genome Engineering. Cell. 2014, 157 (6): 1262-1278.

[12] Sander JD, Joung JK. CRISPR-Cas systems for editing, regulating and targeting genomes. Nat Biotechnol. 2014, 32 (4), 347-355.

[13] Kang XJ, Caparas CIN, Soh BS, Fan Y. Addressing challenges in the clinical applications associated with CRISPR/Cas9 technology and ethical questions to prevent its misuse. Protein & Cell 2017, 8 (11): 791-795.

[14] Liang PP, Xu YW, Zhang XY, Ding CH, Huang R, Zhang Z, Lv J, Xie XW, Chen YX, Li YJ, Sun Y, Bai YF, Songyang Z, Ma WB, Zhou CQ, Huang JJ. CRISPR/Cas9-mediated gene editing in human tripronuclear zygotes. Protein & Cell 2015, 6 (5): 363-372.

[15] Niederberger C. Correction of a Pathogenic Gene Mutation in Human Embryos. J Urol. 2018, 199 (2): 330-330.

[16] Chen FQ, Ding X, Feng YM, Seebeck T, Jiang YF, Davis G. D. Targeted activation of diverse CRISPR-Cas systems for mammalian genome editing via proximal CRISPR targeting. Nat Commun. 2017, 8:14958

[17] Zhang XH, Tee LY, Wang XG, Huang QS, Yang SH. Off-target Effects in CRISPR/Cas9-mediated Genome Engineering. Mol Ther-Nucl Acids. 2015, 4: e264.

[18] Cho SW, Kim S, Kim Y, Kweon J, Kim HS, Bae S, Kim JS. Analysis of off-target effects of CRISPR/Cas-derived RNA-guided endonucleases and nickases. Genome Res. 2014, 24 (1): 132-141.

[19] Zeng HY, Edwards MD, Liu G, Gifford DK. Convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics 2016, 32 (12): 121-127.

[20] Lin J, Wong KC. Off-target predictions in CRISPR-Cas9 gene editing using deep learning. Bioinformatics 2018, 34 (17): 656-663.

[21] Niu R, Peng JJ, Zhang ZP, Shang XQ. R-CRISPR: A Deep Learning Network to Predict Off-Target Activities with Mismatch, Insertion and Deletion in CRISPR-Cas9 System. Genes 2021, 12 (12).

[22] Liu Q, He D, Xie L. Prediction of off-target specificity and cell-specific fitness of CRISPR-Cas System using attention boosted deep learning and network-based gene feature. PLoS Comput Biol. 2019, 15 (10).

[23] Qiu TY, Qiu JX, Feng J, Wu DF, Yang YY, Tang KL, Cao ZW, Zhu RX. The recent progress in proteochemometric modelling: focusing on target descriptors, cross-term descriptors and application scope. Brief Bioinform. 2017, 18 (1): 125-136.

[24] Tsai SQ, Zheng Z, Nguyen NT, Liebers M, Topkar VV, Thapar V, Wyvekens N, Khayter C, Iafrate AJ, Le LP, Aryee, MJ, Joung JK. GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases. Nat Biotechnol. 2015, 33 (2): 187-197.

[25] Tsai SQ, Nguyen NT, Malagon-Lopez J, Topkar VV, Aryee MJ, Joung JK. CIRCLE-seq: a highly sensitive in vitro screen for genome-wide CRISPR Cas9 nuclease off-targets. Nat Methods. 2017, 14 (6), 607-614.

Part 2 Molecular Dynamics

1. Abstract

Expression of SacB converts sucrose to levan, which accumulates in the periplasm and is toxic to E. coli. We used sacB gene as a counter-selection marker in our "strainer". However, the CFU/ug using the wild-type sacB gene was low even using 0.01% sucrose in the media. Structural insights into the wild-type SacB reveals that S164 is important to ensure the stabilization of D86 that is the nucleophilic agent. We speculated that the S164T mutation could decrease the catalytic efficiency. Thus, we modelled the new hydrogen bond formation from S164T and the position of the D86 carboxyl group by molecular dynamics, and tested our conjecture in our wet experiments.

2. Introduction

The "strainer" utilizes the double stranded DNA breaks (DSBs) as a signal to start the transcription of gRNA targeting on the plasmid harboring sacB gene. Expression of sacB converts sucrose to levan, which accumulates in the periplasm and is toxic to E. coli. When the sacB plasmid is cured by CRISPR/Cas system, the successful recombined strain can survival in the media with sucrose. The strain without DSBs, still retains the plasmid harboring sacB gene, cannot survival in the media with the sucrose. In our wet-lab experiments, we found that the toxicity of sacB in our "strainer" was too high. When we used original CRISPR/Cas method and the "strainer" for gene editing. Although we could increase the editing efficiency by our strain compared to original CRISPR/Cas method, the CFU/ug using the "strainer" was much lower than that of original CRISPR/Cas method even using 0.01% sucrose. To this end, we sought to use dry-lab experiment to design a sacB mutant with lower toxicity for E. coli, and this sacB mutant could increase the CFU/μg of the "strainer" with high editing efficiency.

Structural insights into the wild-type SacB reveals that S164 forms a hydrogen bond with the nucleophilic agent D86 and the 4-OH of the fructose group, and S164 is important to ensure the stabilization of D86 (Figure 1). We speculated that the S164T mutation with an additional -methyl would change the orientation of the-OH and would effectively form new hydrogen bonds. Thus, the conformation of the D86 carboxyl group was restricted by hydrogen bonding, results in the reduced hydrolysis rate and cell toxicity. We modelled the new hydrogen bond formation and the position of the D86 carboxyl group by molecular dynamics, and tested our conjecture in our wet experiments.

img

Fig. 1. The first layer means that the amino acid shown in the figure is the closest layer to the substrate (sucrose), and the distance between all amino acids and the substrate is less than 3.5 Å. W85, D86, W163, R246, D247, E342 are completely conservative in GH68 family.

3. Result and Discussion

Molecular dynamics (MD) simulations.

MD simulations were performed using Amber software [1] and the ff99SB force field [2]. The selected docking complexes of SacB-sucrose were solvated in the OPC water model. A simulated truncated octahedral box was built for calculating protein–ligand interactions. The box size was set to avoid interactions through periodic boundaries. Nonbonded interactions were truncated at a cutoff distance of 11 Å. The system was initially equilibrated using the steepest descent method for 5000 steps twice while restraining the atoms of protein and ligand with 10 kcal/mol and 0 kcal/mol, respectively. Then, the system was gradually heated to 300K within 20 ps while maintaining the 20 kcal/mol constraint on protein–ligand. Next, a 1 ns isothermal–isobaric (NPT) ensemble and 1 ns canonical ensemble (NVT) run were performed, both with 5 kcal/mol restraint. Finally, a 20 ns MD run was adopted for equilibration and sampling. All MD simulations were performed with 2 fs time steps with the temperature maintained via a Berendsen thermostat.

Protein–sucrose complexes were equilibrated by detecting the root-mean-square deviation (RMSD) of compounds and protein backbone, and reasonable and equilibrated conformations of the ligand were extracted from the MD simulations (Figure 2). The RMSD values of SacB backbone fluctuated around 1.2 Å indicated the conformations of sucrose were stable.

img

Fig. 2. RMSD of SacB Wt (A) (PDB ID: 1OYG) and variant S164T (B) using sucrose as ligand. In the mutants, the fluctuating value of RMSD was large, indicating the low catalytic efficiency of the mutant.

img

Fig. 3. Binding free energy decomposition calculated by MM/GBSA, including van der Waals energy (A) electrostatic energy (B) non-polar solvation energy(C) polar solvation energy (D) The complexes of SacB and variant S164T with sucrose were indicated in blueness and orange, respectively.

As a whole, we used the MMGBSA (Molecular Mechanics / Poisson Boltzmann (Generalized Born) Surface Area) approach (Figure 3). The overall binding free energy of sucrose molecules is: -18.84 ± 4.10 kcal / mol. In the mutation group, this value is: -21.27 ± 3.77 kCal / mol, which doesn't change much compared to the wild-type SacB (Based on our previous experimental data, the increased binding energy per 4.5 kCal / mol corresponds to a 2 to 3-fold increase in the inhibitor inhibition capacity, which not directly correspond to the catalytic capacity). S164 plays an important role for the stabilization of sucrose molecule in the pocket, with total free energy calculated as -1.61 kcal / mol. The value of the locus in S164T mutation is only -0.41 kcal / mol. Thus, the S164T is the key mutation, which directly leads to the shift of the sucrose molecules in the catalytic pocket (Figure 4), enhanced interaction with amino acid D53 and with E214, and diminished interaction with amino acid GLU307. This mutation breaks the delicate balance of the ternary catalytic amino acid with the ligand. Therefore, it is speculated to reduce the cytotoxicity.

img

Fig. 4. Comparation between variant S164T and SacB Wt. MD simulations of variant S164T (A) and SacB Wt (B) (PDB ID: 1OYG) using sucrose as ligand. The parameters of hydrogen bonds, variant S164T (C) and SacB Wt (D). Conformational change of the D86 orientation results in the partially broken hydrogen bond formed by nucleophilic agent D86 and 4-OH of fructose group , which reduced the efficiency of sucrose hydrolysis.

4. References

[1] Case DA, Belfon K, Ben-Shalom IY, Brozell SR, Cerutti DS, Cheatham TEIII; Cruzeiro VWD, Darden TA, Duke RE, Giambasu G, Gilson MK, Gohlke H, Goetz AW, Harris R, Izadi S, Izmailov SA, Kasavajhala K, Kovalenko A, Krasny R, Kurtzman T, Lee TS, LeGrand S, Li P, Lin C, Liu J, Luchko T, Luo R, Man V, Merz KM, Miao Y, Mikhailovskii O, Monard G, Nguyen H, Onufriev A, Pan F, Pantano S, Qi R, Roe DR, Roitberg A, Sagui C, Schott-Verdugo S, Shen J, Simmerling CL, Skrynnikov NR, Smith J, Swails J, Walker RC, Wang J, Wilson L, Wolf RM, Wu X, Xiong Y, Xue Y, York DM, Kollman PA. Amber20 University of California San Francisco 2020.

[2] Hornak V, Abel R, Okur A, Strockbine B, Roitberg A, Simmerling C. Comparison of multiple Amber force fields and development of improved protein backbone parameters. Proteins. 2006, 65: 712–725.


Sponsors
Contact us
No.2 Linggong Road, Ganjingzi District, Dalian City, Liaoning Province, P.R.C., 116024
© 2022 - Content on this site is licensed under a Creative Commons Attribution 4.0 International license.
The repository used to create this website is available at gitlab.igem.org/2022/dut-china.