L O A D I N G . . .

Software

Overview

To assist our experimental veirification, we designed a sequence comparison software kmer2vec and a promoter strength prediction software Promoter_Transformer. Our kmer2vec was applied in our project to perform multiple sequence comparison to examine the application scope of our engineered bacteria detection system in different species. Promoter_Transformer was employed in our project to conduct computer-based directed evolution of PnisA promoter through promoter strength prediction. Our softwares have been demonstrated on accuracy and efficiency, possessing broad application range in other fields beyond our project.

kmer2vec

Introduction

Researchers or animal breeders may have the interest of applying our sperm quality detection system to other species. To quickly judge the application scope of our engineered bacteria system across species, multiple sequence comparison can be applied by aligning the Sp10/EGFR sequences of different species with the Sp10/EGFR sequences we used to find the distance matrix. If the distance is less than a certain threshold, the corresponding sequence is considered to be likely applicable to our system. Meanwhile, in the process of solving this problem, we can also innovate the algorithm of multiple sequence comparison.

Generally, traditional multiple sequence alignment (MSA) method is accurate in sequence comparison, such as Clustal W [1], multiple sequence alignment based on fast Fourier transform (MAFFT) [2] and multiple sequence comparison by log-expectation (MUSCLE) [3]. However, MSA has a computational challenge when analyzing large and long sequence datasets. In addition, MSA is based on the hypothesis of homology and cannot well reflect the effects of gene rearrangement, gene duplication, horizontal gene transfer and other events [4]. The alignment-free methods transform biological sequences into numerical representations and then analyze them quantitatively, improving the speed of sequence comparison and interpretation significantly. Therefore, we intended to design an alignment-free method to increase the speed of sequence comparison.

Here, we propose an alignment-free method (called kmer2vec) based on word2vec embedding of k-mers in DNA sequences, focusing on phylogenetic analysis and species clustering. Combining the k-mer theory [5] and the word2vec method, the kmer2vec can truly reflect the original information and capture the k-mer contexts in the sequences. The kmer2vec has been demonstrated to be a useful tool for biology sequence comparison, thus accomplishing our goal of determining the application range of our detection system across species.


Our published paper
Fig 1. Our published paper.

The project code can be found on our website. Details on method construction can be found in our published paper "kmer2vec: A Novel Method for Comparing DNA Sequences by word2vec Embedding" [6].

Methods

word2vec word embedding

The word2vec method, proposed by Google[7] , embeds words into meaningful high-dimensional numerical vectors. The basic principle of the word2vec embedding is that words with similar contexts have similar semantics. Through the shallow neural network training on large text data, the word2vec method generates word embedding vectors in the hidden layer of the neural network. The architecture of the neural network includes the continuous bag-of-word (CBOW) model or the skip-gram model. In the training process, CBOW mainly predicts the word from the context words, while skip-gram mainly predicts the context words through a certain word. In the training process, the vector representation of each word is affected by its context vocabulary. If the context vocabulary of the two words is similar, the word vectors of the two words will also be similar. Therefore, the word2vec method can give each word a vector that can reflect its meaning (Fig. 2).


Word2vec helps to find the suitable vectors of words.
Fig 2. Word2vec helps to find the suitable vectors of words.

kmer2vec flowchart

We propose the following overall construction process of the genome comparison method kmer2vec (Fig. 3).

  1. Divide the training DNA sequences into a series of k-mers words.
  2. Use the word2vec method to train the k-mer word2vec vectors.
  3. Divide the sequences under study into a series of k-mer words, note that if a k-mer of the DNA/protein sequences under study does not appear in the training dataset, its vector contains all zeros.
  4. Take the average and variance of all the k-mer word vectors of each sequence to form the sequence vector.
  5. Calculate the pairwise distance of the corresponding sequence vectors to quantify the difference among the sequences.

The overall construction process of the <b><i>kmer2vec</i></b> method.
Fig 3. The overall construction process of the kmer2vec method.

How to use kmer2vec


Here we show how to use the kmer2vec tool.
Step 1 (Input dataset):
Organize the open data in NCBI or your own sequencing data into a dataset in Fasta format.

Step 2 (Use kmer2vec):
Run the kmer2vec main program to process the input dataset, and then kmer2vec will return three files. File 1 is the result of segmenting the input dataset, File 2 is the saved word2vec neural network model, and File 3 is the distance matrix generated by the multiple sequence comparison.

Step 3 (Visualization):
The generated distance matrix file can be input into MEGA X [8] to make a dendrogram reflecting the distance relationship.

Results

Phylogenetic tree of 16 coronaviruses and 11 sperm protein 10 sequences

To test the plausibility of kmer2vec, we performed phylogenetic analysis using a dataset containing 16 coronavirus whole genome sequences. After obtaining the distance matrix, we used the Unweighted Pair-group Method with Arithmetic Mean (UPGMA) [9] method to build the phylogenetic tree (Fig. 4A). The results demonstrated that our kmer2vec alignment-free method may achieve the same phylogenetic trees built by the MSA method, conforming to the biological properties of the coronaviruses.

After validating the feasibility of our software, we applied our software to Sp10 protein sequence comparison across species. We downloaded Sp10 protein sequences of different species and performed sequence comparison analysis to draw a dendrogram. As shown in Fig. 4B, kmer2vec well reflected the degree of similarity between sequences across species. For example, the SP-10 sequences belonging to primates were all in the same branch.


The phylogenetic tree of the 16 coronaviruses using <b><i>kmer2vec</i></b>.
Fig 4. The phylogenetic tree of the 16 coronaviruses(A) and SP-10 sequences(B) using kmer2vec.

Clustering results of bacteria and influenza viruses

To test the clustering effect and verify the speed advantage of our software, we input a larger bacterial dataset for cluster analysis and received satisfactory clustering results within relatively short computing time. As shown in Fig. 5A, the dots of the same color indicated that these bacteria belong to the same taxa. Since the dots of the same color were all clustered together, our software was demonstrated to have good clustering effects.

In addition, when working with large datasets, we can also appreciate the speed advantage of kmer2vec compared to MSA. The test can be finished within one hour using kmer2vec (when setting the parameter ‘epoch’ to ‘1’ in the word2vec model). In contrast, it can not be finished within one day using the MSA method on the same computing platform.

As for the infulenza virues, we used the neuraminidase (NA) gene dataset to test the clustering effect of kmer2vec again. From the results in Fig. 5B, it can be concluded that the clustering effect of kmer2vec was satisfactory, making all the species of the same taxa clustered together.


The Clustering results of bacteria(A) and influenza viruses(B) using <b><i>kmer2vec</i></b>
Fig 5. The Clustering results of bacteria(A) and influenza viruses(B) using kmer2vec.

Conclusion

In summary, we developed an alignment-free method kmer2vec that can be applied in rapid multiple sequence comparison, focusing on phylogenetic analysis and species clustering. This method demonstrated good accuracy and great speed advantage on multiple datasets, which can be used in our project to quickly retrieve the applicability of our system.

Our algorithm has the potential of a wide range of application scenarios when evaluated from the follwing aspects.

  • Usefulness to other projects
    Our software can be very useful for other projects. For example, our kmer2vec approach can come in handy if a project requires large multiple sequence alignments. Our method runs more than 10,000 times faster than traditional algnment methods, saving computing resources and shortsening project cycle significantly.
  • Incorporation into new workflows
    Our software can be easily embedded in new workflows. In fact, the function of our main program is to convert the data in FASTA format into a distance matrix between sequences. This distance matrix can be easily embedded into tree software such as MEGA X, resulting in beautiful tree diagrams. Since our distance relationship is presented in matrix form, it can be embedded in various new workflows through mathematical operations and variations.
  • User-friendliness
    Our software kmer2vec is user-friendly and well documented. We uploaded all the code and data on GitLab along with a detailed user manual.

  • TAdvantages of kmer2vec
    Fig 6. Advantages of kmer2vec.

Promoter_Transformer

Introduction

The promoter is one of the core components in synthetic biology, which can control the strength of transcription and thus the strength of gene expression [10]. Therefore, finding a promoter with suitable strength is crucial for constructing an expression vector. In our project, the promoters used for the two-component system were too weak to present fluorescence signal to be observed by naked eyes. Therefore, we intended to perform computer-based directed evolution to screen for stronger corresponding promoters for the two-component system. To achieve such goal, we developed the Promoter_Transformer model software to predict the promoter strength based on input sequences, facilitating computationally assisted construction of promoters of different strengths.

There are some related studies on the models of prokaryotic promoter strength prediction, but most of them are classification models, which are used to predict whether a promoter is a strong promoter or a weak promoter [11]. In an article published in Nucleic Acids Research, the authors combined a deep generative model that guided the search for artificial sequences with a predictive model to preselect the most promising promoters [12]. The dataset used for their predictive model was E. coli promoter sequences and the expression levels of their corresponding genes[13] . Since prokaryotes do not have many regulatory elements like eukaryotic enhancers, the expression level of their corresponding genes can be indirectly regarded as the strength of the promoter.

In our study, we mainly focused on the construction of the predictive model that can be applied to our promoter directed evolution. We used the dataset of E. coli promoter sequences and their corresponding gene expression levels, and modified the predictive model based on Transformer to make it applicable to our biological problems. It was demonstrated that our model outperformed the model in the Nucleic Acids Research article in predictive performance [12]. The project code can be found on our website.

Methods

Transformer

Transformer is proposed by Google in their paper “Attention is all you need” [14], which is originally applied to machine translation problems. Transformer has made great progress and has been studied more deeply afterwards. Transformer relies on the knowledge of Self-Attention, which is a method widely used in deep learning to improve the effect of machine translation, allowing higher parallelism than traditional Recurrent Neural Networks (RNNs). Its overall structure consists of encoders and decoders (Fig. 7). In each encoder, the data first passes through a Multi-Head Attention layer and then through a Feed Forward layer. An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The mechanism of the special attention "Scaled Dot-Product Attention" used in the Transformer model can be seen in Fig. 8. Additionally, there is a residual connection [15] around each of the two sub-layers, followed by layer normalization [16]. The structure of the decoder is generally similar to that of the encoder, and the decoder is mainly used to extract information from the output.


The overall structure of Transformer
Fig 7. The overall structure of Transformer

The mechanism of the special attention “Scaled Dot-Product Attention” used in Transformer
Fig 8. The mechanism of the special attention “Scaled Dot-Product Attention” used in Transformer

Promoter_Transformer

Based on the Transformer's structure, we constructed a model that can be used to predict promoter strength, called Promoter_Transformer. Promoter_Transformer regards the promoter sequence as text, performs word segmentation through the k-mer method, and then extracts the base information of the promoter sequence for training. Since our task is essentially a regression task, different from a machine translation task, we mainly used the encoder in Transformer to extract information. After the word segmentation, we built a vocabulary, and then input the sentences into the embedding layer. Positional encoding can be selected to perform or not. The formula for positional encoding is shown below.

fomula9

The tensors were then fed into the Transformer Encoder layer. After that, the tensors would be flattened, and the prediction data can be obtained through the fully connected layer. Note that between the flattening layer and the linear layer, we used the Dropout method to prevent overfitting effectively. The overall structure of Promoter_Transformer can be seen in Fig. 9. More detailed modeling building explanation can be found on Modeling page.


The overall structure of <b><i>Promoter_Transformer</i></b>
Fig 9. The overall structure of Promoter_Transformer

Directed evolution of promoters

Using Promoter_Transformer to perform promoter directed evolution requires the following two steps.
Step 1 (Random mutation):
First, we should choose the target promoter and its evolution direction. For example, in our project we intended to evolve the promoter PNisA in the Nisin pathway to have a stronger strength. We then used a computer to randomly mutate the sequence of this promoter, and the number of mutated bases can be adjusted. We can then obtain a large number of mutated sequences obtained on the basis of this promoter sequence. This step can be conducted with the mutation.py file in our code.

Step 2 (Strength screening):
We input the large number of mutant sequences into the Promoter_Transformer model and selected the Top 20 sequences with the highest predicted strength based on the model feedback. In this way, we completed the directed evolution of the promoter on the computer, which can then be tested experimentally. This step can be conducted with the Promoter_Transformer.py file in our code.

Results

The predictive performance of Promoter_Transformer

During model building, we employed the Pearson correlation coefficient to assess the correlation between predicted and actual results. We used the same dataset as Wang et al. and performed similar splits. Wang et al. mainly used the Convolutional Neural Network (CNN) model [17] for prediction, and the result reached 0.25. In our project, we also tried multiple model architectures and obtained correlations around 0.14 in the Recurrent Neural Network (RNN) model architecture [18] and around 0.22 in the Long Short-Term Memory (LSTM) model architecture [19]. Ultimately, we refined the Transformer-based framework to adapt it to our biological problem and achieved a correlation above 0.32, indicating that our prediction results reached an advanced level under this evaluation standard.

The promoter directed evolution results with experimental verification

With the prediction model successfully constructed and trained, we applied it to the directed evolution of the PnisA promoter. Essentially our model has no requirement for sequence length. However, since the lengths of the sequences in the training dataset we used were all 50 nt before the transcription start site, here we also used the 50 nt at the corresponding position of PnisA to predict. We first mutated it, and the number of mutated bases is within 5. We ended up with 50,000 mutant sequences. Then we input the mutation sequence into the trained prediction model, and finally obtained the corresponding prediction strengths. We finally selected the 10 sequences with the highest predictive strength as alternative optimized sequences.

Consequently, experiments were performed to validate our directed evolution results. Briefly, we introduced these 10 promoters into our plasmid pNisin2, transformed them into MACH1-T1 strain and detected EGFP expression with and without nisin induction by microplate reader. Results indicated that the fluorescence signals were enhanced in strains with and without nisin induction, among which 80% of the promoters (except PnisRRH07 and PnisRRH10) demonstrated successful improvement (Fig. 10).


The EGFP expression level detected by microplate reader
Fig 10. The EGFP expression level detected by microplate reader

Furthermore, flow cytometry results confirmed that positive percentage of fluorescence signals increased in 90% of mutated promoters, indicating that our software was efficient in enhancing promoter strength (Fig. 11).


The percentages of positive fluorescence signal bacteria detected by flow cytometry
Fig 11. The percentages of positive fluorescence signal bacteria detected by flow cytometry

Conclusion

Overall, we have built a new deep-learning model that can be used to predict the strength of prokaryotes promoters. On the same dataset, the predictive performance of our model outperformed the predictive model by Wang et al [12] by 30%. Additionally, our software was validated by promoter mutation experiments with about 90% effectiveness. In this project, this model was used for the directed evolution of promoters for two-component system. In the future, this model and software may have a wider range of applications, such as screening for promoters of suitable strengths.


The effect of <b><i>Promoter_Transformer</i></b>
Fig 12. The effect of Promoter_Transformer

Our algorithm has the potential of a wide range of application scenarios when evaluated from the follwing aspects.

  • Supporting existing synthetic biology standards
    Our software supports existing synthetic biology standards because promoters are one of the core elements in synthetic biology. Finding a promoter with suitable strength is crucial for constructing an expression vector. With our Promoter_Transformer, we can achieve directed evolution of different promoters. It is worth noting that the direction of our directed evolution is not fixed, but can be changed according to research purposes. Therefore, our software is very useful for the construction of promoters with different strengths.
  • Experimental validation
    The results of our software have be validated by experimental work. We mutated the PnisA promoter to obtain 50,000 mutated sequences. Then the mutation sequence was input into the trained prediction model to predict corresponding promoter strength. After that, we selected the top 10 promoters with highest predicted strength and verified them experimentally.
  • Usefulness to other projects
    Our software can be very useful for other projects. It can be applied to predicting promoter strength before experiments, performing directed evolution with a small number of mutations in the promoter sequences, and screening for promoters with suitable strength. Our software has more obvious advantages when the data volume is larger.
  • Incorporation into new workflows
    Our software Promoter_Transformer can be easily embedded in new workflows. The corresponding promoter strength value can be obtained after inputing a promoter sequence. Therefore, it can be easily incorporated into various workflows such as directed evolution.
  • User-friendliness
    Our software Promoter_Transformer is user-friendly and well documented. We uploaded all the code and data on GitLab along with a detailed user manual.

References

[1] Thompson, J. D., Higgins, D. G., & Gibson, T. J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic acids research, 22(22), 4673–4680. https://doi.org/10.1093/nar/22.22.4673
[2] Katoh, K., Misawa, K., Kuma, K., & Miyata, T. (2002). MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic acids research, 30(14), 3059–3066. https://doi.org/10.1093/nar/gkf436
[3] Edgar R. C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research, 32(5), 1792–1797. https://doi.org/10.1093/nar/gkh340
[4] Zielezinski, A., Vinga, S., Almeida, J., & Karlowski, W. M. (2017). Alignment-free sequence comparison: benefits, applications, and tools. Genome biology, 18(1), 186. https://doi.org/10.1186/s13059-017-1319-7
[5] Blaisdell B. E. (1986). A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Sciences of the United States of America, 83(14), 5155–5159. https://doi.org/10.1073/pnas.83.14.5155
[6] Ren, R., Yin, C., & S-T Yau, S. (2022). kmer2vec: A Novel Method for Comparing DNA Sequences by word2vec Embedding. Journal of computational biology : a journal of computational molecular cell biology, 10.1089/cmb.2021.0536. Advance online publication. https://doi.org/10.1089/cmb.2021.0536
[7] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
[8] Kumar, S., Stecher, G., Li, M., Knyaz, C., & Tamura, K. (2018). MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms. Molecular biology and evolution, 35(6), 1547–1549. https://doi.org/10.1093/molbev/msy096
[9] Sokal, R. R. (1958). A statistical method for evaluating systematic relationships. Univ. Kansas, Sci. Bull., 38, 1409-1438.
[10] Smale, S. T., & Kadonaga, J. T. (2003). The RNA polymerase II core promoter. Annual review of biochemistry, 72(1), 449-479.
[11] Xiao, X., Xu, Z. C., Qiu, W. R., Wang, P., Ge, H. T., & Chou, K. C. (2019). iPSW (2L)-PseKNC: a two-layer predictor for identifying promoters and their strength by hybrid features via pseudo K-tuple nucleotide composition. Genomics, 111(6), 1785-1793.
[12] Wang, Y., Wang, H., Wei, L., Li, S., Liu, L., & Wang, X. (2020). Synthetic promoter design in Escherichia coli based on a deep generative network. Nucleic Acids Research, 48(12), 6403-6412.
[13] Thomason, M. K., Bischler, T., Eisenbart, S. K., Förstner, K. U., Zhang, A., Herbig, A., ... & Storz, G. (2015). Global transcriptional start site mapping using differential RNA sequencing reveals novel antisense RNAs in Escherichia coli. Journal of bacteriology, 197(1), 18-28.
[14] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
[15] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
[16] Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
[17] O'Shea, K., & Nash, R. (2015). An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458.
[18] Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.
[19] Sundermeyer, M., Schlüter, R., & Ney, H. (2012). LSTM neural networks for language modeling. In Thirteenth annual conference of the international speech communication association.