Active Learning Strategies for Efficient Growth of BGC Database (DSI-SRP)

This DSI-SRP fellowship funded Chaehyun “Amy” Song to work in the laboratory of Dr. Allison Walker, Ph.D. in the Department of Pathology, Microbiology and Immunology during the summer of 2023. Amy is a rising junior with a major in Biochemistry and Chemical Biology and a minor in Computer Science.

In this project, Amy and her mentor focused on natural products that are small molecules produced by living things, that often have therapeutically relevant bioactivities such as antibiotic activity. Dr. Walker developed machine learning models that predict a natural product’s activities directly from the sequence of its biosynthetic gene clusters (BGCs). To determine this sequence-activity relationship, existing datasets of natural products with known activity were employed to build the training dataset. The identified classifiers trained on the dataset have predicted biological activities with accuracies of up to 80%, but lower accuracies were observed for certain classifiers due to the small dataset.

Accordingly, this research focuses on the strategic growth of the dataset by incorporating new BGCs that would be the most valuable addition to the dataset. Active learning algorithms were employed to primarily focus on areas of the data space that are the most informative/uncertain. For convenience, “modAL” (an active learning framework for Python3 built on top of scikit-learn) was utilized to compare different query strategies and active learning models.

Simultaneously, experimental work was undertaken with several new bacterial strains to detect secondary metabolites in a high throughput manner. After 16S PCR and sequencing confirmed the identity of the bacterial strains, mass spectrometry was employed to detect the bacteria’s natural products. This project is part of an ongoing effort in the Walker lab. In the future, they will use a strategy referred to as “metabologenomics” to quickly assign metabolites to the BGCs that produce them. They will then use activity assays to determine the activity of natural products produced by BGCs prioritized by active learning. Finally, they will determine if including these new datapoints in our training dataset improves the accuracy of the model’s predictions.

Ultimately, the research seeks the improvement of the machine learning model’s accuracy by developing a well-balanced dataset and furthermore contributing towards natural product antibiotic discovery.

Graphs of Chaehyun Song's work for her DSI-SRP project

In addition to receiving support through a DSI-SRP fellowship, this project was supported and facilitated by the DSI Data Science Team through their regular summer workshops and demo sessions.