Introduction

Wetlab


Modern approaches to engineering complex genetic circuitry increasingly necessitates the use of multi-gene cassettes. Such cassettes allow for the coordination of expression for several genes simultaneously, facilitating the development of unique biosynthetic and/or regulatory systems in vivo. When designing these multi-operon circuits, diversification of paired promoter-terminator sequences is crucial to prevent homologous recombination, as the repeated promoter sequences may be recombined in an unexpected order, nullifying the cassette. The Anderson promoter set, characterized in the Registry of Standard Biological Parts[1], spans a 1000-fold range of expression from a small combinatorial library. These promoters were derived by diversifying the consensus region of the σ70 RNA polymerase subunit, allowing for fine-tuned control of σ70 affinity. This promoter family, though, does not have varied promoters of comparable strengths that could be used in multi-operon circuits. We propose an at-scale screening system for identifying promoter sequences with different strengths, beginning by screening for high-expression promoters, followed by the generation of libraries from promising hits. To do so, we perform saturated mutagenesis on the sequential context within and surrounding the σ70 consensus region. This process can be repeated indefinitely to yield new promoters of desired strengths. As randomization of this region enables surprising variance in expression strength of the downstream ORF, our work poses new questions concerning the σ70 binding mechanism, while also creating useful and novel promoters for use in multi-gene cassettes.

[1] Anderson, J. Christoper. parts.igem.org/Promoters/Catalog/Anderson.
[2] Hossain, Ayaan, et al. "Automated design of thousands of nonrepetitive parts for engineering stable genetic systems." Nat Biotechnol, 2020. doi: 10.1038/s41587-020-0584-2. pubmed.ncbi.nlm.nih.gov.
[3] Urtecho, Guillaume, et al. "Systematic Dissection of Sequence Elements Controlling σ70 Promoters Using a Genomically Encoded Multiplexed Reporter Assay in Escherichia coli." Biochemistry, 2019. doi: 10.1021/acs.biochem.7b01069. pubmed.ncbi.nlm.nih.gov.

Computational Biology


Computational skills are often an afterthought when it comes to broadening students’ expertise in biology—and the quantitative skills necessary for today’s domains of research are only taught in a few classes that many may not end up taking.

Given the wide-ranging suite of tools that each of our computational teams use, our computational biology training workflow was customized for each team. Each team’s project leads created a custom Jupyter notebook aggregating important foundational literature, relevant tools being used at the current stages of their project, and a mini-experiment allowing students to briefly demonstrate their mastery of concepts by applying them to generate novel data or code for their project. Through the training workflow, all members gained a strong conceptual foundation for their project through collaborative onboarding of these tools and exposure to real-world datasets and softwares that could be utilized in their project downstream. These notebooks are now accessible to the public on our GitHub: https://github.com/igematberkeley/compbio-training.

Protein Protein Interactions (PPI)

The PPI team is interested in identifying and designing potentially therapeutic peptide inhibitors that target and disrupt protein-protein interactions that may otherwise be implicated in disease. They tackle this challenge by testing models for protein-protein interactions as well as methods for characterizing peptide-based inhibitors’ possible interactions with a target protein. This semester, they are exploring some more established (Rosetta) and more novel approaches, and are interested in designing their own algorithms to explore protein structure and prediction.

For PPI’s training sprint, team members were introduced to background terminology and literature in the peptide design field, as well as the team's previous Rosetta pipeline for peptide design. Team members worked through executing the first step of the pipeline, PeptiDerive, identifying candidates for inhibitors for five different protein-protein interactions of their choice. In this custom Rosetta pipeline, PeptiDerive is first run to computationally derive a peptide, from which traditionally only the best candidate has been kept in while discarding all other inferior outputted candidates. This sprint therefore features an analysis of all Peptiderive outputs created by the members to better characterize the variation in quality of these discarded candidates, to evaluate whether more of these candidates are worthy of further downstream analysis rather than only the top peptide. Such evaluations will help shape future research for the team by providing a comphrensive benchmark of current software.

Chemical Reaction Mining via Transformers (CheRMiT)

CheRMiT is focused on integrating methods in language processing, cheminformatics, and machine learning to automate the process of mining and curating reactions from scientific literature. This information can be particularly useful for biochemical researchers, but manual curation methods are inefficient and error-prone. The team is building a pipeline to mine and process chemical reactions from millions of papers, by building chem-informed language models that can extract chemical reactions and validate them using cheminformatics libraries. Ultimately, a complete resource will not only provide curated information on thousands of enzymatic reactions for researchers to use, but may also assist in making predictions about undocumented reactions.

For the training sprint, CheRMiT team members completed two notebooks focusing on the two parallel prongs of the project: one on utilizing new cheminformatics libraries and the other introducing the concepts of language models in machine learning.

The first notebook introduces students to cheminformatic modeling and various machine representations of compounds and reactions, then walks them through a small exercise in manually annotating a paper to explore the kinds of useful features present. The final do-it-yourself component of the notebook asks students to implement a reaction operation validator using a sample paper and SMARTS reaction operators.

The second notebook walks students through a mixture of foundational ML topics as well as NLP-specific tools that the team has utilized. After briefly exploring topics and readings in gradient descent and PyTorch, students then explore and implement utilities from the Snorkel and HuggingFace Transformers libraries. In particular, students first implement a custom Snorkel labeling function for semi-supervised annotation of training data, and then test out an application of a new, relevant model called ChemRxnBERT on a sample paper.

Hallucinating Scaffolds

The Hallucinating Scaffolds team focuses on designing scaffolds for particular enzymatic active sites of interest to create proteins that support catalytic function. These de novo generation of enzymes with novel scaffolds will allow the team to design proteins with improved energetic stability and higher binding affinity while maintaining favorable active site functionality. Currently, the team is focusing on integrating methods from RosettaDesign, such as TrDesign, and AlphaFold, to design proteins that can eventually be tested in the wetlab.

​​ For HSF's training sprint, team members were introduced to the general project description, literature on the protein design and hallucination process; and explored two different de novo protein design pipelines (Tischer’s model and trDesign (trRosetta for protein design)). In addition to providing onboarding instructions for utilizing internal compute resources, the notebook provides a breakdown of BASH commands and theoretical aspects of protein design models, and how to search for features and properties of proteins of interest.

Team members brainstormed ideas for potential quantitative analysis of hallucination outputs, then collectively engaged in a round of alanine scanning. Alanine scanning is an in silico useful mutagenesis technique used to determine the contribution of specific residues to the stability and functionality of given proteins. Students therefore mutated each of the assigned 50 residues of the amino acid sequence of phytoene desaturase (PDB ID: 4dgk) independently, generated PSIPRED secondary structure predictions and IUPred disorder predictions, and used traces and heatmaps to analyze changes due to mutations. Ultimately, students ran this experiment to gain exposure to the kinds of quantitative analysis necessary for characterizing essential domains; and shed light on the choice of residues to be preserved during protein hallucination to optimize functionality.