engineering

Research

We began the project by extensively searching the literature, and we also consulted researchers and professors in related fields. During our conversation with professors, they frequently complained it repeated and time-consuming to screen manually for an enzyme when there is no available suitable one in the database to catalyze the target reaction. They regularly do such work simply based on experience or maybe just by random attempt.

Based on this problem, we conducted further research. We found that several existing software aims to solve these problems. For example, XTMS applies inverse synthesis to search for the metabolic pathways that produce the target compound; and PathPred, proved useful in multistep response prediction. Yet, they do have several defects. For example, XTMS, based on a limited E. coli database, could not identify the precursor material; and using PathPred, it is still inevitable that intermediate reactions will be disconnected. Naturally, an idea occurred to our mind that we may be able to do something to make the research process easier.

Design

Preliminary design

Our preliminary design was as follows:

Frontend
- Users input a reaction, mark the participation substructure and select the corresponding reaction type
Backend
- Receive request from frontend
- Conduct primary selection by screening the enzymes able to catalyze the reactions that contain the substracture marked in given reactions
- Pass the enzymes to the similarity-computing module
- Compute reaction similarity and sort the primarily selected enzymes based on it
- Backend returns the top 30 enzymes according to the sort results

Here is a flowchart of our preliminary backend design.

Advantages

Learning from other existing similar software like PathPred, Ketcher and so on, we notice the importance of user-friendliness. So we choose to design our web software from multiple perspectives.

From users' point of view, we have optimized the way to use the software. Compared to using the command line to access the software, our web software provides a GUI artboard. Based on Ketcher, we design a user-customized interface supporting users to use our software according to their own preferences. We allow users to draw chemical structures, import .mol file to generate chemical structures.

And from a feedback perspective, our software return to users a complete set of informations about enzyme with intuitive and understandable informations. What's more, our software also return Brenda database links to our predicted enzymes which supports users to explore further informations.

Besides, from the point of view of software implementation, our software supports fuzzy search for the sake of some wrong informations users input. Simultaneously, to increase the speed of enzyme searching, we adopt a heuristic search algorithm to balance the accuracy and speed of searching processes.

It is worth mentioning that our software is free of environment dependence. Our software is running on the web browser and users don't need to download a sophisticated software in their local Windows, Linux or Macos computer. It is easy to accessible to our web software as long as you can access the Internet.

Build

Data Collecting

We collect enzyme, cofactor, kinetic parameters data from public database Brenda. In order to facilitate use and query, we extract the desired information from the collected data, rebuild it into a new database, and integrate it into our web software.

Database Building

We mainly maintain 3 kinds of databases: reaction DBs, enzyme DBs and kinetic info DBs. Reaction DBs contain information of reactions, mostly used to in prescreening process, to improve MEI's performance. Enzyme DBs include the basic info of each enzyme number, such as organisms, enzyme name and so on. Kinetic info DBs have precise pH, Km, Kcat and temperature info of each enzyme. We obtain these data from open-access databases, such as Brenda and ExplorEnz. A lot of optimization has been done when buliding our self-construct database. In our database, we store the reactions and moledules in SMILES format, instead of the name or chemical formula in the source database. This is to expand the representation power of text info. In support of fuzzy search, we introduce SOUNDEX algorithm, and save an extra soundex column for each column representing a name, so that MEI is robust of the case when user has a typo in the query.

Model Selection

Enzyme Kinetics

In biochemistry, Michaelis–Menten kinetics is one of the best-known models of enzyme kinetics. The model takes the form of an equation describing the rate of enzymatic reactions by relating reaction rate.

This equation is called the Michaelis–Menten equation, a mathematical model of the reaction. The model is used in a variety of biochemical situations other than enzyme-substrate interaction, including antigen–antibody binding, DNA–DNA hybridization, and protein–protein interaction. It can be used to characterise a generic biochemical reaction, in the same way that the Langmuir equation can be used to model generic adsorption of biomolecular species.

RxnSim

RxnSim provides methods to compute chemical similarity between two or more reactions and molecules. Molecular similarity is computed based on structural features. Reaction similarity is a function of similarities of participating molecules. The package provides multiple methods to extract structural features as fingerprints (or feature vectors) and similarity metrics. It additionally provides functionality to mask chemical substructures for weighted similarity computations. It uses rCDK and fingerprint packages for cheminformatics functionality.

Technical Support

Back-end

The back-end web framework we uesd is Django. Django makes it easier to build better web apps more quickly and with less code. We know now that REST APIs are important because they let us interact in an easy way with the database. So we ues the Django REST framework to achieve RESTful style and Django REST framework is a powerful and flexible toolkit for building Web APIs. And we use MySQL for its better performance, high availability, scalability, platform-frendly and friendly-interface. Ray is a general-purpose framework for programming a cluster. Ray enables developers to easily parallelize their Python applications or build new ones, and run them at any scale, from a laptop to a large cluster. Ray provides a highly flexible, yet minimalist and easy to use API. We choose Ray for its excellent performance in parallel computing.

Front-end

Ketcher version 1.0 was released under GNU Affero General Public License v3.0

In order to help users successfully complete the input of response information and get the returned information intuitively, we have integrated Ketcher version 1.0 (released under GNU Affero General Public License v3.0) in the product interface. It allows users to input chemical structures as freely and easily as drawing, or use the built-in simple chemical structure to add groups with one click. It can also directly import files in SMILES or MDL Molfile format, providing users with a variety of input options.

At the same time, we use VUE to clearly and clearly display the information returned by the back-end for users, which are displayed in two formats: card and list. The list format adopts a compact data format and displays information to users in a centralized manner in text. The card format uses the RDKit plug-in to directly draw the chemical structure of the reaction and the product in the SVG format for the most similar reaction to display the results more vividly to the user.

Service Deployment

Nginx is an open source reverse proxy server for HTTP, HTTPS, SMTP, POP3, and IMAP protocols, as well as a load balancer, HTTP cache, and a web server (origin server). The nginx project started with a strong focus on high concurrency, high performance and low memory usage. It is licensed under the 2-clause BSD-like license and it runs on Linux, BSD variants, Mac OS X, Solaris, AIX, HP-UX, as well as on other *nix flavors. It also has a proof of concept port for Microsoft Windows.

We deploy our frontend and backend project with Nginx and uwsgi. They enables us to use multiple threads and a message queue, thus improving the concurrency and stablity of MEI. In order to connect Web server with calculation server, we turn to rpc so that the backend API can communicate with Ray clusters on the other server.

Test

Dry Lab Validation

As we continue to refine our project, we do extensive dry lab validation to test our project model and performance. Our dry lab validation is based on the dataset of the 22nd (2003) edition of the IUBMB-Nicholson Metabolic Pathways Chart which contains updated pathways involved in ATP metabolism in the mitochondria and chloroplast. We selected multiple different types of reactions from this pathway to validate our software model, and we did get desirable results confirming the functional reliability of our software.

Learn

Iterations

We invited our instrcutor to test our software and he advised that we should take cofactors, organism and kinetics into consideration. He also reminded us of auxiliary reactions that recycled cofactors, which may help reduce the reaction cost.

Owing to these precious suggestions, we made the following changes:

Option expanded for users
Cofactor and organism options were added to the frontend interface and taken into account during the enzyme primary selection in backend for more granular searches. Besides, we implented the elastic search of organism option with soundex.
More information
To be more relevant to the actual situation, basic return contained extra information about the optimum temprature and pH of the selected enzymes.
Furthur return of the auxiliary reaction
Apart from the basic return, we added a furthur return, which gave the relevant kinetic constants to inform researchers could of the suggessted amount ratio of enzyme and substrate for higher efficiency.

Improvement of Build

Building a user-friendly interface
Under the suggestion of our consultant Cao Yaozhong, we modified the chemical structure input sketchpad, built-in more chemical structures commonly used in synthetic biology, and added buttons to allow users to input more commonly used groups with one click.
In the process of testing our products, the USTC team that cooperated with us found that everyone has their own commonly used chemical structures or groups because they work in their own fields, which often need to be input repeatedly. Features that can be customized based on the needs of each user.
We use the localstorage property of the browser to allow the user to store the currently input chemical structure or group locally by the user through custom keys (several can be stored, these files take up very little space and will not cause any damage to the user's device. adverse effects), and then you can easily enter these common structures with one click, so that users can customize their own common groups according to their own work and research needs.
Database Optimization
MEI computed the similarity of the reaction requested with each reaction in the database. We have a quite large reaction database, which resulted in a slow response. To solve this problem, we presented a prescreening module, dividing the database into several parts and filtering the query into one of subdatabase according to the reaction type. This signifigantly reduced the time cost. But there were more to optimize. In support of fuzzy searching, we introduced soundex algorithm, which increased a time cost at a O(n) scale. Our users mentioned a delay of more than 10 seconds after adding this feature. To solve this, we pre-processed each record in the database, sacrificing more space on the disk, but saved lots of time of calculating soundex output for existing records. Furthermore, we created a index on the target preprocessed column using B+ tree model, which directly decreased the time complexity to O(log n), reduced to less than 3 seconds, and greatly improved user performance.
Performance Improvement
Due to the big scale of the data primarily selected, it took our preliminary software a long time to compute similarity. Therefore, we first seperated the web module and the calculation module, deploying them on different servers. Then we modified the parameters passed to RxnSim and enabled Cache, which improved the speed by 30%. We also found the low efficiency of one-process programming, which could hardly utilize all the resources of servers. Hence, we used a ditributed-computing framework, ray, to furthur improve our speed. The time needed for the same inputs were then reduced by 50%. In pursuit of a better performance, we improved our server to a high-performance one, thus tremendously improving our speed. After all the optimization in those four aspects, the time needed to compare 5000 reactions was reduced from 90s to only 10s.
Message Queue
Our preliminary software were quite weak and only supported one client online at the same time. To improve concurrency, we used uwsgi and modified the execution logic of ray so that we could support more clients visiting our sites simultaneously.
Framework Optimization
With all the optimizations and advice above and after numerous iterations, our framework changed greatly. Our preliminary framework was as follows and it was simple and weak.
After the additon of the message queue, ray clusters and RPC, here is our final framework, with high performance and more stablility.

Reference

1. Varun Giri, Tadi Venkata Sivakumar, Kwang Myung Cho, Tae Yong Kim, Anirban Bhaduri, RxnSim: a tool to compare biochemical reactions, Bioinformatics, Volume 31, Issue 22, 15 November 2015, Pages 3712–3714, https://doi.org/10.1093/bioinformatics/btv416

2. Moritz, Philipp et al. “Ray: A Distributed Framework for Emerging AI Applications.”[arXiv:1712.05889](https://arxiv.org/abs/1712.05889) [cs.DC] https://doi.org/10.48550/arXiv.1712.05889

3. [Modern Parallel and Distributed Python: A Quick Tutorial on Ray | by Robert Nishihara | Towards Data Science](https://towardsdatascience.com/modern-parallel-and-distributed-python-a-quick-tutorial-on-ray-99f8d70369b8)

4. [What Is MySQL And Why It Is Used? (softwaretestinghelp.com)](https://www.softwaretestinghelp.com/what-is-mysql/)

5.Lehninger Principles of Biochemistry.W.H.Freeman.

6. https://en.wikipedia.org/wiki/Michaelis%E2%80%93Menten_kinetics

7. https://en.wikipedia.org/wiki/Nginx