Results

Wetlab


Saturation mutagenesis is proven to be successful from the sequencing results. We successfully created a library of mutants with randomized sequences but kept the -35 and -10 boxes - TTGACA and TATAAT - untouched as showcased below.

Characterizing promoter strengths we retransformed successfully cloned plasmids, plated and picked four colonies per plate for scale-up liquid culturing. Reporter signal was measured 20 hours post-culturing. GFP emission strength was measured at 525 nm after excitation at 483 nm. All data was normalized as the ratio of relative fluorescence unit to cell density at OD600. To showcase the significant variance in signal strength by just performing saturated mutagenesis on the sequential context surrounding the consensus polymerase binding boxes at the promoter, we then normalized all mutants against the white, untransformed Mach1 cells as signal fold change in comparison to wild-type.

The result showcased greater variance than expected. Out of the small batch of 12 hits, the standard deviation of the mean standardized fluorescence across different mutants arrived at 55%. This significant variance shed light on how mutating sequential contexts of established robust promoters might lead to the discovery of promoters that are even more potent. For future experiments, we will screen for more mutants and run characterization in comparison with the established J23199 strong constitutive promoter.

Computational Biology


Protein-Protein Interactions (PPI)

In order to better understand the candidate generation part of our peptide inhibitor design pipeline, we examined the inhibitor candidates that Peptiderive, a program in the Rosetta software suite, output for 11 different protein-protein interactions (PPI) downloaded from the Protein Data Bank. For the three PPIs involving complex proteins with multiple chains, we compared the binding energy scores output for the best candidate from each partner-receptor chain pair. For the remaining eight PPIs, we analyzed the binding energy of all possible candidates assessed by the program for each partner-receptor chain pair.

Selected interface scores as generated from candidates for PPIs composed of multiple chains:


Overall, we noticed that in both of these analyses (one set of which is presented above), the candidate output by Peptiderive was not always a clear winner. This is important because the remainder of our pipeline currently revolves around optimizing the structure for the best candidate, and all other inhibitor candidates are discarded by Peptiderive. Consequently, we may be ignoring inhibitors that, once structurally optimized, outperform our current inhibitor because Peptiderive does not take into account the conformational changes that peptides undergo during the docking process. Therefore, in future updates to our pipeline, we will be exploring ways to extract and incorporate the other strong candidates from Peptiderive into the remainder of our model.

CheRMiT (Chemical Reaction Mining via Transformers)

Among 9 training team members, we produced several versions of chemical reaction validators and Snorkel labeling functions. Across the cheminformatics models, we processed a selection of papers, extracting possible chemicals and reacting them combinatorially with a key of known reaction motifs and looking for validated output. Similarly, we tested team members’ Snorkel labeling functions on a small set of ground truth sentences, collecting metrics and evaluating missed sentences for further analysis.

Notebook 1 (Cheminformatics):
Each model was benchmarked on this paper.
Model 1: Found 60 possible reactions // Phil’s
Model 2: Found 1 possible reactions // Guru’s
Model 3: Found 2 possible reactions // Tara’s
Model 4: Found 39 possible reactions // Cassie’s

The differences between these two could be explained due to the limitations of the lookup mechanism. Due to overly restrictive or under restrictive text parsing, as well as the limitations of the PubChem API lookup, there were differences in the amount of chemicals labeled. Of these reactions, there were some repeats, as well as several spurious results based on the query API returning chemicals for words where there were none. See below for a spurious returned chemical:


This is an issue that would be fixed by inputting better quality data into the pipeline, so that it wouldn’t struggle with these types of issues. For a given dataset, about 31% of words that returned a result from the PubChem database were not chemicals, and only matched up well enough to return a hit. Therefore, more possible chemical reactions does not suggest that the model is better. A number closer to the more restrictive models is closer to what is expected for a paper. However, it is worth noting that each of the text parsers were implemented in a very rudimentary fashion, which caused many false matches. With the addition of much cleaner data with a high likelihood of being a chemical, the accuracy would improve dramatically and the number of spurious reactions would drop to near zero.

Notebook 2 (Natural Language Processing)
Students made unique labeling functions:

Phil's Labeling Function Guru's Labeling Function

An optimal Snorkel model is made up of an ensemble of labeling functions, since each individual labeling function encodes a different heuristic. In addition, most labeling functions abstain on sentences that their heuristic does not cover. Therefore, analysis of an individual labeling function using typical classification metrics does not always tell us how useful the labeling function would be in practice.

We tested labeling functions on a dataset of 405 ground truth sentences. Of these sentences, 45 were labeled as true (contains a chemical reaction) and 360 were labeled false (does not contain a reaction). This discrepancy in positive and negative data helps us simulate the proportion of positive and negative sentences within a research paper.

False/Abstain LF:
To evaluate the False/Abstain labeling function, we calculated the LF’s false negative rate (FNR), false omission rate (FOR), true negative rate (TNR), and negative likelihood ratio (NR-).

As the LF abstained on sentences it did not label false, we did not calculate metrics corresponding to true positive and false positive data. We also note that individual LFs aren’t expected to cover the entire dataset, which means that metrics such as the TNR don’t necessarily reflect usefulness.

FNR: 0.2 (FN/P) FOR: 0.0776 (FN/PN) TNR: 0.297 (TN/N) LR-: 0.673 (FNR/TNR)

We also include a false negative sentence for in-depth analysis:
Interestingly, the allele of pro1 was shown to enhance the activities of gamma-glutamyl kinase and gamma-glutamyl phosphate reductase, both of which catalyze the first two steps of l-proline synthesis from l-glutamate and which together may form a complex in vivo.

The student’s LF failed on this sentence because it detected the word “allele”, which typically would not be found in a sentence containing a chemical reaction.

True/Abstain LFs:
To evaluate the True/Abstain labeling functions, we calculated the LFs’ true positive rate (TPR), false positive rate (FPR), false discovery rate (FDR), and positive likelihood ratio (LR+).

Guru:
TPR: 1.0 (TP/P) FPR: 1.0 (FP/N) FDR: 0.888 (FP/PP) LR+: 1.0
Cassie:
TPR: 1.0 (TP/P) FPR: 1.0 (FP/N) FDR: 0.888 (FP/PP) LR+: 1.0
We notice that for both of these LFs, they are actually labeling every sentence in the test set as true. This is because the test set was preprocessed so that every sentence contains multiple chemical entities, as we wanted sentences to have an explicit substrate and product.

True/False LF:
To evaluate the True/False LF, we calculate accuracy, precision, recall (TPR), and true negative rate (TNR).

Accuracy: 0.701 ((TP + TN) / (P + N)) Precision: 0.253 (TP/PP) Recall (TPR): 0.867 (TP/P) TNR: 0.681 (TN/N)

For accuracy, labeling negative sentences correctly is more important than labeling positive sentences correctly due to the increased amount of negative sentences. This can be confirmed by comparing the accuracy to the TNR and TPR.

We include a false positive and false negative sentence below for in-depth analysis:

FN: In addition to the racemization it also catalyzes specific elimination of l-ser to pyruvate.

FP: Newly synthesized dopamine is stored in the terminals and then released, stimulating postsynaptic dopamine receptors and mediating the antiparkinsonian action of levodopa.

For the first sentence, the student’s LF failed to recognize the reaction verb “eliminate”. For the second sentence, the student’s LF mistook the word “synthesized” for a synthesis reaction. Therefore, we can add “eliminate” and remove “synthesize” from the LF to improve it.

Hallucinating Scaffolds Data

See the mutation heatmaps from our Hallucinating Scaffolds team below, as well as the relevant Jupyter notebook on our GitHub. You can also view the PSIPRED secondary structure predictions and IUPred disorder predictions. Each residue of the amino acid sequence of phytoene desaturase (PDB ID: 4dgk) was mutated independently. IUPred disorder predictions and PSIPRED secondary structure predictions of all mutants were generated. The change in disorder and in secondary structure is plotted below for all mutants. Change in disorder and secondary structure will be represented in 2 ways: traces and heat maps. The x-axis represents the residue position. The y-axis represents new disorder/ss value - old disorder/ss value. Heat maps sacrifice a bit of readability to compress the information in the traces. They also make it easier to observe correlations between alanine mutants. x-axis: reference amino acid sequence
y-axis: (new disorder/ss value - old disorder/ss value) across the sequence, per mutant color gradient: increasingly red for increasingly positive value changes, increasingly blue for increasingly negative value changes

Disorder: Both the traces and heatmap show that the effect of the alanine mutants on disorder is very localized. Disorder prediction values increase/decrease slightly in a region of ~20 residues centered on the position of the mutant. Some of these 20-residue regions show particularly small changes in disorder. They correspond to areas of the protein that were already disordered (i.e., protein loops), so the alanine mutants had very little effect. Secondary structure: Unlike what is observed for disorder, the traces and heatmaps of the alanine mutants show that changes in secondary structure are not localized. Across all alanine mutant positions, a few regions in the protein are particularly prone to secondary structure changes. Many of these regions are the substrate-binding regions identified by Schaub, et. al (2012) (see PDB ID: 4DGK). Many of the alanine mutants appear to "break" helical regions (decreased H character and increased C character), but this is misleading since most of the substrate-binding regions are beta-strands. Oddly, the heatmaps show that some beta-strand regions have higher E character in the alanine mutants. A possible future experiment would investigate whether these regions correspond to highly-conserved domains of the protein.