Saturation mutagenesis is proven to be successful from the sequencing results. We successfully created a library of mutants with randomized sequences but kept the -35 and -10 boxes - TTGACA and TATAAT - untouched as showcased below.
Characterizing promoter strengths we retransformed successfully cloned plasmids, plated and picked four colonies per plate for scale-up liquid culturing. Reporter signal was measured 20 hours post-culturing. GFP emission strength was measured at 525 nm after excitation at 483 nm. All data was normalized as the ratio of relative fluorescence unit to cell density at OD600. To showcase the significant variance in signal strength by just performing saturated mutagenesis on the sequential context surrounding the consensus polymerase binding boxes at the promoter, we then normalized all mutants against the white, untransformed Mach1 cells as signal fold change in comparison to wild-type.
The result showcased greater variance than expected. Out of the small batch of 12 hits, the standard deviation of the mean standardized fluorescence across different mutants arrived at 55%. This significant variance shed light on how mutating sequential contexts of established robust promoters might lead to the discovery of promoters that are even more potent. For future experiments, we will screen for more mutants and run characterization in comparison with the established J23199 strong constitutive promoter.
In order to better understand the candidate generation part of our peptide inhibitor design pipeline, we examined the inhibitor candidates that Peptiderive, a program in the Rosetta software suite, output for 11 different protein-protein interactions (PPI) downloaded from the Protein Data Bank. For the three PPIs involving complex proteins with multiple chains, we compared the binding energy scores output for the best candidate from each partner-receptor chain pair. For the remaining eight PPIs, we analyzed the binding energy of all possible candidates assessed by the program for each partner-receptor chain pair.
Selected interface scores as generated from candidates for PPIs composed of multiple chains:
Overall, we noticed that in both of these analyses (one set of which is presented above), the candidate output by Peptiderive was not always a clear winner. This is important because the remainder of our pipeline currently revolves around optimizing the structure for the best candidate, and all other inhibitor candidates are discarded by Peptiderive. Consequently, we may be ignoring inhibitors that, once structurally optimized, outperform our current inhibitor because Peptiderive does not take into account the conformational changes that peptides undergo during the docking process. Therefore, in future updates to our pipeline, we will be exploring ways to extract and incorporate the other strong candidates from Peptiderive into the remainder of our model.
Among 9 training team members, we produced several versions of chemical reaction validators and Snorkel labeling functions. Across the cheminformatics models, we processed a selection of papers, extracting possible chemicals and reacting them combinatorially with a key of known reaction motifs and looking for validated output. Similarly, we tested team members’ Snorkel labeling functions on a small set of ground truth sentences, collecting metrics and evaluating missed sentences for further analysis.
Notebook 1 (Cheminformatics):
Each model was benchmarked on this paper.
Model 1: Found 60 possible reactions // Phil’s
Model 2: Found 1 possible reactions // Guru’s
Model 3: Found 2 possible reactions // Tara’s
Model 4: Found 39 possible reactions // Cassie’s
The differences between these two could be explained due to the limitations of the lookup mechanism. Due to overly restrictive or under restrictive text parsing, as well as the limitations of the PubChem API lookup, there were differences in the amount of chemicals labeled. Of these reactions, there were some repeats, as well as several spurious results based on the query API returning chemicals for words where there were none. See below for a spurious returned chemical:
This is an issue that would be fixed by inputting better quality data into the pipeline, so that it wouldn’t struggle with these types of issues. For a given dataset, about 31% of words that returned a result from the PubChem database were not chemicals, and only matched up well enough to return a hit. Therefore, more possible chemical reactions does not suggest that the model is better. A number closer to the more restrictive models is closer to what is expected for a paper.
However, it is worth noting that each of the text parsers were implemented in a very rudimentary fashion, which caused many false matches. With the addition of much cleaner data with a high likelihood of being a chemical, the accuracy would improve dramatically and the number of spurious reactions would drop to near zero.
Notebook 2 (Natural Language Processing)
Students made unique labeling functions:
See the mutation heatmaps from our Hallucinating Scaffolds team below, as well as the relevant Jupyter notebook on our GitHub. You can also view the PSIPRED secondary structure predictions and IUPred disorder predictions.
Each residue of the amino acid sequence of phytoene desaturase (PDB ID: 4dgk) was mutated independently. IUPred disorder predictions and PSIPRED secondary structure predictions of all mutants were generated. The change in disorder and in secondary structure is plotted below for all mutants.
Change in disorder and secondary structure will be represented in 2 ways: traces and heat maps.
The x-axis represents the residue position. The y-axis represents new disorder/ss value - old disorder/ss value.
Heat maps sacrifice a bit of readability to compress the information in the traces. They also make it easier to observe correlations between alanine mutants.
x-axis: reference amino acid sequence
y-axis: (new disorder/ss value - old disorder/ss value) across the sequence, per mutant color gradient: increasingly red for increasingly positive value changes, increasingly blue for increasingly negative value changes
Disorder:
Both the traces and heatmap show that the effect of the alanine mutants on disorder is very localized. Disorder prediction values increase/decrease slightly in a region of ~20 residues centered on the position of the mutant. Some of these 20-residue regions show particularly small changes in disorder. They correspond to areas of the protein that were already disordered (i.e., protein loops), so the alanine mutants had very little effect.
Secondary structure:
Unlike what is observed for disorder, the traces and heatmaps of the alanine mutants show that changes in secondary structure are not localized. Across all alanine mutant positions, a few regions in the protein are particularly prone to secondary structure changes. Many of these regions are the substrate-binding regions identified by Schaub, et. al (2012) (see PDB ID: 4DGK). Many of the alanine mutants appear to "break" helical regions (decreased H character and increased C character), but this is misleading since most of the substrate-binding regions are beta-strands. Oddly, the heatmaps show that some beta-strand regions have higher E character in the alanine mutants. A possible future experiment would investigate whether these regions correspond to highly-conserved domains of the protein.