Exploring sequence landscape of biosynthetic gene clusters with protein language models

Published: 17 Jun 2024, Last Modified: 16 Jul 2024ML4LMS PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: biosynthetic gene cluster, natural product discovery, protein language model, transfer learning, contrastive learning, natural language processing
Abstract: Many organisms, such as bacteria, fungi, and plants, produce intricate chemicals that are not needed for their growth and reproduction, and thus are called secondary metabolites or natural products (NPs). NPs are a rich source of drugs, with most antibiotics being derivatives of NPs. In a producer organism, NPs are synthesized by a set of enzymes encoded by genes that often lie near each other on the chromosome and are called a biosynthetic gene cluster (BGC). In this work, we explore the capability of protein language models (PLMs) to produce meaningful representations of BGCs. We employ transfer learning to train models to predict the chemical class of the produced compound and explore the topological properties of the produced embeddings. The code is available at project's GitHub repository: https://github.com/kalininalab/NaturalPPLuM.
Poster: pdf
Submission Number: 128
Loading