NeuraSyn works by measuring and analysing impedance change of aptamers on binding to a specific protein on the bacterial membrane. In this process, we used aptamers that were generated by whole cell selex. This is the most commonly used and preferred method for developing selective and specific aptamers against a specific bacteria of interest. This method does not specify the protein against which the aptamer is binding. This limits the potential of the technique as the aptamer cannot be then modified in-silico to increase its binding ability. With the receptor unknown, we are also unable to use aptamers as therapeutic agents and unable to check if the aptamer binds to other proteins of different origins due to structural homology.
We hence developed KAMI (Kwick Aptamer-based Motif Identification). This software takes in a list of polypeptide FASTA files from a list of proteins of interest. We also created an option of generating polypeptides that mimic natural proteins.
We analysed proteins and did a literature survey to come up with rules for generating polypeptides that mimicked natural protein fragments.
20 Amino Acids: A R N D C Q E G H I L K M F P S T W Y V
Amino acids with α helix propensity (α pool): A L R M K Q E H
Amino acids with β sheet propensity (β pool): I V T F Y N W C M L S Q
PseAAC is a method for vectorizing proteins for several types of further processing. Either a single polypeptide is used, or the chains of many proteins are divided, and each chain is then vectorized separately. The amino acid frequency (AAC) and the correlation between the residues at a specific distance, or kth tier correlation, are both included in the vector that was created. The formula creates a vector with a length of 20 + lambda (the maximal value of k). For proteins that carry out comparable functions or exhibit structural homology, the correlation between the residues and the amino acid frequencies is a factor that is similar.
For every polypeptide of length L, we can write the primary sequence as the following vector:
We can then, based on the following formulae, convert the primary sequence vector into a numeric vector generated by PseAAC. Here, w is the weight factor, tau(k) is the k-th tier correlation factor which contains information about the sequence order correlation amongst k-th residues. Here, lambda is the maximum value of k.
Tau can be formulated as follows:
Here, phi(Ri) is the q-th function of Ri and gamma is the total number of functions under consideration.
The vectors produced by this method are used to train AI classifiers. We have categorised membrane proteins based on their structural similarity using the same vectorization method. We have proved the ability of this method to discover the optimal binding protein from an Aptamer synthesised from whole cell Selex for the creation of the programme KAMI (Kwick Aptamer binding Motif Identification). By identifying the ideal binding protein for aptamers of E. coli and S. typhimurium, we have further tested our programme. We initially download all of the FASTA data for membrane proteins from the RCSB PDB and sort them according to the chains they are made up of. Then, using this file as input, we do a K-mean clustering operation on the polypeptide chains using the KAMI programme. The cluster with the lowest binding energy after binding to the target aptamer is found by docking the cluster centres. This suggests that the cluster has the most effective binding protein. After segmenting the cluster, we dock randomly chosen sequences from each segment. Although the programme also includes a recommended number of cluster predictors, the number of clusters and segments is user-defined. The segment that exhibits the best binding is chosen, and it is continually separated to produce the best binding protein. This significantly lowers the number of docking operations required to determine the optimal binding protein and enables the identification of the protein that an aptamer binds to, despite the fact that it was made from whole cell selex. This gives rise to the possibility of improving the aptamer's binding capability through in-silico alterations as well as repurposing the aptamer.
General workflow of KAMI
tSNE plot of generated polypeptides. The lambda value chosen as 14, hence the vectors generated were 34 dimensional. The clustering was done using 10 clusters.
Link to software https://gitlab.igem.org/2022/software-tools/iiser-mohali
Step 1.1: Uploading FASTA file (mention lambda and number of clusters)
Step 1.2: Generate random protein (mention length, lambda and number of clusters)
Step 2: Click run
Step 3: Download cluster centers and aptamers, begin docking.