Recursive name, critical acclaim
Teams competing in iGEM go to great lengths to produce high quality projects, and for many teams, this involves the creation of modelling sub-projects and software tools to support this research. Though many teams document their work on their respective wikis, the sheer volume of information makes it infeasible for teams to manually search and parse through years of wiki content to find information potentially relevant to their projects. In order to help combat this, we developed SAMARA, the Software And Modelling Aggregating Research Assistant. SAMARA serves as a replacement for the 2018 Calgary team’s SARA subproject, and incorporates many of its best features while avoiding some of its pitfalls and flaws. In doing so, we hope to create an infinitely expandable, automated system to serve as a well-documented platform for teams looking to conduct research, as well as to create a framework for future teams to build their own software tools from.
Though SAMARA takes heavy inspiration from SARA, and although the two “sisters” were designed to solve similar problems, the project itself was created from the ground up in order to be as robust and versatile as possible, solving many of SARA’s shortcomings. Before software development could begin, however, we needed a set of objectives to structure the project around.
Scalability and flexibility were important goals for SAMARA. Unlike SARA, which featured many hard-coded systems that made future changes difficult, we wanted to encourage modification and alterations to create custom solutions for future teams. This resulted in us designing the system with modularity in mind, allowing each team to change parameters as necessary for their project. This also allows for us (or future teams) to easily scale the system’s reach and extend its mandate in order to collect and summarize even more information.
One of the key issues with SARA was overall inefficiency. For instance, many of the libraries featured in the original program were old and slow, and have since been made redundant by superior alternatives. This inefficiency, however, was nothing relative to the largest inefficiency for SARA: time. The project’s decision to use manual summarization was one that created a huge bottleneck while simultaneously ensuring that subsequent teams would never maintain the framework created. Simply put, iGEM teams don’t have the time to keep the system up to date year after year. Thus, one of the main goals of SAMARA was its near-full autonomy: once started, the system should run and summarize the relevant pages on its own, without human intervention.
Another of SARA’s flaws was its lackluster documentation. Though some information of the project was available, many of its inner workings and design decisions were obscured and difficult to understand. The Python files lacked comments, and half-implemented features remained in the repository, making it a convoluted mess to follow and interpret, making much of its source code near unusable. To ensure that this wouldn’t happen again, we aimed for meticulously-commented, human-readable files for SAMARA.
SAMARA was developed using Python due to its versatility and its extensive network of well made, well documented libraries. SAMARA’s implementation was two-part: the creation of the web-scraper iGEMScraper.py and the creation of the Django-based front-end deployment.
iGEMScraper is an incredibly intelligent web-crawling, scraping, and exporting program. The program was created using the powerful Scrapy library, allowing for a very simple, yet robust implementation. The library allows for a substantially faster scraping process versus SARA’s slow, requests/BeautifulSoup-based implementation. The program is driven by a Scrapy “CrawlSpider,” which, when given a start URL, will automatically follow every link on that page. After doing so, it follows all the links on a second page, and so on and so forth. By restricting the allowed domains (to avoid scraping the entire internet) and restricting the URLs (because a few URLs per team wiki is sufficient due to the existence of navbars), we are able to produce a thorough search of all pages without much effort on our part. The CrawlSpider is then able to filter URLs based on certain keywords or parameters and pass them (and the scraped information) to the processing stage.
In the processing stage, valuable information is extracted from both the URL and the scraped page information. The processing stage takes this information, formats it, and prepares it for extraction. During this phase, the pages are filtered for certain criteria, including length (to avoid situations where the scraper scrapes a blank page) and for certain keywords (matching the default iGEM page text). This helps avoid wasting time and computational power on false-positive pages.
Following the processing state is the summarization stage. The code implements the "sshleifer distilbart cnn 12-6" model, a machine learning model trained for several natural language processing tasks, such as summarization in our case (1). The summarizer is implemented as a function, takes a single scraped software or modelling page, and breaks it into smaller sections. The model has a maximum token limit to the amount it can summarize at a time. Each section is then summarized individually and compiled at the end to form a complete summary. The method of summarization used by the summarizer is an abstractive summarization method which involves using deep learning methods to paraphrase and summarize the given text (2). This method is more effective than other summarization methods, such as the extractive method (3).
Finally, the scraper exports the information gathered as either a .csv file (useful for manual parsing via Excel) or a JSONLines .jl file (a more elegant solution for using the data in later programs.)
The iGEMScraper files are available for download on the iGEM Calgary Software Tools Gitlab page, and the repository contains more technical information, including workflows, requirements, and other important documentation.
Though named officially as a recursive acronym, the SAMARA deployment is the product of a well-integrated scraper-to-database pipeline. The deployment was created using Python’s Django library both because of its popularity and its ability to easily connect to Scrapy through the scrapy-djangoitem library.
The scraper for the deployment is near-identical to that of iGEMScraper, except for lack of .csv or .jl files produced upon export. Instead, the program’s integration to Django allows for the files to be exported directly in Django’s preferred format, the .sqlite3 database file format. This allows for the information to be directly read, processed, and displayed by Django without dealing with the constant delays and inefficiencies caused by the import, export, and conversion of data.
The completed Django app is then able to be deployed locally to give users a more robust searching experience. The web interface features the ability to search through the information, allowing users to quickly find relevant summaries. We hope this will help to streamline and expedite the research conducted by future teams. The source code for the SAMARA Research Assistant is available on the iGEM Calgary Software Tools Gitlab, and features all necessary information to get started or modify the project.
Though SAMARA is a powerful and robust program, there are areas for improvement that should be considered in future iterations. One such area is the relatively high amount of false-negatives. SAMARA is imperfect, and relies on some assumptions to properly parse and extract information from webpages. Though iGEM has a relatively high level of standardization amongst the different wiki pages, the system is unable to function when pages don’t follow these assumptions and “break the norm.” This, unfortunately, is difficult to mitigate apart from creating more conditions to cast a wider net on the information. However, recent research on the usage of AI in web-scraping hopes to change this. Some companies seek to develop solutions to this issue by employing machine learning models to autonomously detect relevant information, thereby mitigating SAMARA’s inability to work with broken HTML or non-standard web pages.
The world of machine learning has gained significant ground in terms of implementing natural language processing models that can accurately perform summarizations. Hence, in testing several summarizers, we discovered the best of these models to be the GPT-3 models created by open AI (3). Although good in summarization, our current model, the “sshleifer distilbart cnn 12-6” model, is not as efficient and consistent as the GPT-3 models. The lack of implementation of this model was the monetary costs associated with it. SAMARA is designed to be an autonomous system and open source software that can be made of use by many future iGEM teams, and it would not be feasible to make use of a paid service. However, open AI reduces costs as time passes and eventually makes these models free of charge (4). Therefore, to improve the summarization stage, we can use the model in the future when the prices are more sustainable or when it is released to have no cost. Another method that can be used to improve SAMARA’s performance on the summarization stage would involve fine tuning and training our current summarization model on iGEM specific software and modelling pages.