Investigating involution as a technique for aiding wet lab
Involution is a novel machine learning technique pioneered by Li et al.(1) that inverts the way that traditional convolution understands its data. Convolution is a machine learning technique that processes its data in a channel-specific, spatial-agnostic way. Their spatial-agnostic nature is considered a strength as it enables the network to extract and recognize features regardless of where they appear in the input (such as an image). However, this also limits the ability of the networks to adapt to and recognize features of the data that appear in varied spatial patterns and orientations. As well, their channel specificity lends itself to redundancy in the data being processed, resulting in lengthy processing times. In an effort to ameliorate these deficiencies, involution inverts these principles, processing its data in a spatial-specific, channel-agnostic way. This means that the network assigns weights to account for variation in spatial patterns and positioning. As well, it reduces redundancy by kernels across channels, meaning that channels may be processing several types of features/areas of input. The result is a network that is faster, able to better process large-scale datasets, and can learn to prioritize features that it extracts and recognizes based on their spatial orientation. Involution has been validated as a method that is faster and as accurate as convolution in the context of image recognition, but its performance has not been examined in either the biological context or on one-dimensional data. We sought to investigate whether involution is an effective neural network technique in these cases.
We decided that the best way to test the performance of involution on a biological dataset would be to replicate an existing convolutional network. We reached out to Dan Veltri, the first author of "Deep learning improves antimicrobial peptide recognition" (2) which details the development and assessment of a CNN-LSTM for the identification of antimicrobial peptides. He is also the person responsible for maintaining the associated frontend for the model, AMP Scanner. Veltri et al.’s model is a hybrid CNN and long short-term memory (LSTM) model that takes FASTA sequences as inputs and outputs a decimal predicting whether the given sequence is antimicrobial. The network utilizes convolution, which is able to identify and map our features, as well as an LSTM, a type of neural network that works well on sequential data. LSTMs are able to process sequences of data, such as videos, audio clips, and handwriting segments - this makes it useful for FASTA sequences as it can recognize instances where the sequence and positioning of segments of the data are important features, incorporating this information and informing its predictions. LSTMs work by processing data and choosing to either forget it or store it into “long-term” memory, learning what is important to retain and look for as it makes its way down long sequences of data. It then passes down what it deems important to make its predictions. The network is incredibly accurate, reaching 87.80% accuracy prior to hyper-parameter tuning, and up to 91.01% following tuning. Dr. Veltri graciously provided us with the code and models used for AMPScanner, enabling us to start our replication.
We chose to use the Keras INN framework, and inferred that it would be simple to convert from the CNN as Veltri et al.(2) had also used Keras frameworks. However, we quickly realized that the replication was not going to be as simple as swapping out one piece of architecture for another. Veltri et al.(2) used TensorFlow version 1.2.1, which cannot be used with Python 3 - the original model and its corresponding packages were written for Python 2, which is now deprecated. In order to make the model from the paper run on our own devices, a prerequisite to modifying it into an INN, we needed to either move the INN to newer software versions, or set up a virtual machine/computer to run the older versions. As support has now ended for Python 2, and the setup and maintenance of a virtual machine or device would be difficult and time-intensive, we decided to migrate the code to Python 3. Once we did this, we ran the model and verified that the outputs and accuracy rates were consistent with what was reported in the paper. Next, we got to work modifying the convolution in the network to involution, leaving the LSTM unchanged to paint a more accurate picture of how involution vs. convolution performed in the model. We then ran into another roadblock in the replication - the Keras involution framework was made to take two-dimensional data (such as an image) as its input, but Veltri et al.’s model was one-dimensional to correspond to the dimensionality of a FASTA sequence. As involution is a new concept, there is no involutional framework designed for one-dimensional input data. We were able to solve this issue by reshaping the data so that it could be fed into the two-dimensional framework. After modifying the network to take one-dimensional FASTA sequences as input, we trained the model, tuned hyperparameters, and evaluated its accuracy using the same datasets as Veltri et al.
The INN-LSTM performed comparably to the CNN-LSTM. After training and tuning, followed by evaluation using the test datasets, the INN-LSTM achieved an accuracy of 90.59%, in comparison to the 91.01% accuracy of the CNN-LSTM. It also measured similarly under metrics of sensitivity and specificity.
Figure 1. Comparison of sensitivity, specificity, and accuracy between INN and CNN.
The INN-LSTM scored well on the Matthew’s correlation coefficient, a statistical rate that scores a network based on how many type I and type II errors made in comparison to correct predictions (3), scoring 0.813 against a maximum score of 1 and minimum score of -1. The network performed exceptionally using the measure of area under the receiver operating characteristic (AUROC)(4), which scores a models ability to successfully discriminate between cases (antimicrobial) vs non-cases (non-antimicrobial) - the minimum AUROC is 0.5, indicating that the classifier performs no better than random selection, to 1.0 indicating that it is a perfect classifier; the INN-LSTM scored 0.96.
Figure 2. Comparison of MCC and AUROC between INN and CNN.
When assessing candidates for peptides to functionalize our bacterial cellulose, we realized that many of the peptide fragments we intended to recombinantly express and insert had not been experimentally validated as antimicrobial. After narrowing down several options that met our other criteria (food safe, thermostable), we ran the FASTA sequence of the nisin fragment we wished to insert through both the Veltri et al. CNN-LSTM, as well as our INN-LSTM.
Figure 2. Prediction of the antimicrobial properties of our intended nisin fragment, given its FASTA sequence. A prediction of 0.5 or greater indicates it predicts antimicrobial properties.
Both returned the prediction that it was indeed antimicrobial. Once the fragment was successfully cloned and inserted, Kirby-Bauer tests were able to experimentally validate that the inserted fragment was antimicrobial, in alignment with the neural network predictions.
All of the code used in this tool is hosted on GitLabs, and is accessible here!
In the future, it would be valuable to compare performance metrics between the CNN-LSTM and INN-LSTM, to validate the increased computational performance of involution. Veltri et al. did not publish metrics of performance such as computation time or memory usage so we were not able to make these comparisons. In the future, given time to collect the data under otherwise identical conditions for both models we would be able to investigate this. As the INN can be used to make a prediction about any FASTA sequence, it can inform future iterations of Cellucoat by providing insight into other potential inserts, which can then be combined with Golden Gate assembly to modularize Cellucoat to be functionalized with peptides specific to the pathogens it wishes to combat.