OPUS-GO: Unlocking Residue-level Insights from Sequence-level Annotations Using Biological Language Models

Published: 06 Mar 2025, Last Modified: 26 Apr 2025GEMEveryoneRevisionsBibTeXCC BY 4.0
Track: Machine learning: computational method and/or computational results
Nature Biotechnology: Yes
Keywords: protein language model; model interpretability; protein gene oncology
Abstract: Accurate annotation of protein is crucial for understanding their structural and functional properties. Existing biological language model (BLM)-based methods, however, often prioritize sequence-level classification accuracy while neglecting residue-level interpretability, as sequence-level annotations are easier to obtain. To address this, we introduce OPUS-GO, a method that improves sequence-level predictions while also providing detailed residue-level insights by pinpointing critical residues associated with functional labels. By employing a modified Multiple Instance Learning (MIL) strategy with BLM representations, OPUS-GO outperforms baseline methods in both sequence-level and residue-level classification accuracy across various downstream tasks for protein sequences, including Gene Oncology (GO)-term prediction for proteins. Furthermore, the identified residues can serve as promising “prompts” for molecular design models, such as ESM-3, enabling the generation of sequences with the desired functionality.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Presenter: ~Ruoxi_Zhang1
Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 76
Loading