FusOn-pLM: A Fusion Oncoprotein-Specific Language Model via Focused Probabilistic Masking

Published: 17 Jun 2024, Last Modified: 17 Jun 2024AccMLBio PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Protein Language Models, Masked Language Modeling, Cancer
TL;DR: We fine-tune ESM-2 using a focused masking strategy to generate fusion oncoprotein-specific embeddings.
Abstract: Fusion oncoproteins, a class of chimeric proteins arising from chromosomal translocations, drive and sustain various cancers, particularly those impacting children. Unfortunately, due to their intrinsically disordered nature, large size and lack of well-defined, druggable pockets, they have historically been challenging to target therapeutically: neither small molecule-based methods nor structure-based approaches for binder design are strong options for this class of molecules. Recently, protein language models (pLMs) have demonstrated success at representing protein sequences with information-rich embeddings, enabling downstream design applications from sequence alone. However, no current pLM has been trained with fusion oncoprotein sequences and thus may not produce optimal representations for these proteins. In this work, we introduce FusOn-pLM, a novel pLM that fine-tunes ESM-2 embeddings on fusion oncoprotein sequences via masked language modeling (MLM). We specifically introduce a novel MLM strategy, employing a binding-site probability predictor to focus masking on key amino acid residues, thereby generating more optimal fusion oncoprotein-aware ESM-2 embeddings. Our model improves performance on fusion oncoprotein-specific benchmarks in comparison to baseline representations, including biophysical embeddings as well as base ESM-2 embeddings, motivating downstream usage of FusOn-pLM embeddings for therapeutic design tasks targeting these fusions.
Submission Number: 21
Loading