Disclaimer: we want to be very clear that we did not create this software, we improved on it. The credit for designing this software tool goes to Vilnius-Lithuania-iGEM 2021, mainly to Ieva Pudžiuvelytė. What we did was rewrite the programming code of this software to another language as well as add additional functionality.

Introduction

The dry lab team of Vilnius-Lithuania 2022 decided to contribute to both parts of "NanoFind" project: creating a responsive design-driven wiki page as well as contributing to an already existing software tool created by Vilnius-Lithuania iGEM 2021 team.

GenFusMSA

The script that our team decided to improve is called GenFusMSA which was originally developed by the previous Vilnius-Lithuania team. The creation of this software was provoked by the need of using an additional input such as including multiple sequence alignment (MSA) into modeling programs while attempting to model a fusion protein system. Unfortunately, such popular tools as trRosetta and RoseTTAFold were not able to accurately manage the fusion protein cases that the team needed. That's how Vilnius-Lithuania iGEM 2021 team decided to create MSA files for input modeling on their own by using sequence-search methods.

Motivation

Our team decided to improve this software solution after reaching out to Vilnius-Lithuania iGEM team, mainly to the author of the software - Ieva Pudžiuvelytė. We received a suggestion to update the GenFusMSA program - rewrite the source code to Python programming language in order to make the software tool more accessible and user-friendly.

Description

The purpose of the GenFusMSA has not changed - it generates multiple sequence alignment files that can be used for fusion protein modeling with programs like trRosetta, RoseTTAFold, and AlphaFold2. By providing your own MSA files, it is possible to receive more probable structures with less disordered domains that belong to the linked proteins.

The script is a simple program that scans input full query-template .a3m files and pairs sequences according to their taxonomy ID. The paired sequences are joined via a peptide linker that is determined by the user, and the collection of sequences can be saved as an output .a3m file.

Benefits of improving the software

We believe that by rewriting/converting the logic of the developed program from Perl programming language to Python we have achieved the following results:

Increased accessibility of the program for more researchers, thanks to the greater prevalence of Python.
Meeting the demand for more precise/probable protein structures by the researchers.
More scalable and reusable tool that can be easily integrated with other modern synthetic biology systems/solutions.
Increased functionality, as we have added a possibility to check the results of the calculations in several ways (by receiving the following files): .a3m file, .txt file, and .csv file. Compared to receiving the results of the calculations only in the System console or .a3m file.

All these aspects guarantee that the overall functionality and user experience of the program were significantly increased after the changes that we conducted.

Usage

As for the user manual:

The user should install Python Interpreter and has to download the GenFusMSA.py script before executing the program. It is also required to have two .a3m files as input (input1.a3m and input2.a3m parameters) which can be generated by the HHblits program. Additionally, the user is asked to set a linker sequence (linker parameter), how many times it is repeated (linkerRepeats parameter) and if it is prolonged or not (isProlonged parameter):

python GenFusMSA.py input1 input2 linker linkerRepeats isProlonged

Example:

python GenFusMSA.py input1.a3m input2.a3m EAAAK 1 true

By default, the newly generated .a3m file can be found in the same folder where the script is located together with .txt and .csv files. Detailed usage and documentation of the software can be found in the Gitlab repository.