Extending Prot2Token: Aligning Protein Language Models for Unified and Diverse Protein Prediction Tasks

Published: 06 Mar 2025, Last Modified: 18 Apr 2025ICLR 2025 Workshop LMRLEveryoneRevisionsBibTeXCC BY 4.0
Track: Full Paper Track
Keywords: protein language model, large language model, protein prediction, autoregressive transformer, post-translational modifications, kinase, structure prediction, protein language processing
TL;DR: We extend Prot2Token to a wide range of protein prediction tasks
Abstract: Comprehensive protein function and property prediction remains a major challenge due to the vast diversity of sequences, structural variations, and limited labeled data. Existing models are often specialized to be task-specific, requiring independent training, which limits scalability. To address this, we extend Prot2Token, a unified autoregressive framework that focuses on the post-training alignment of pre-trained protein language models (PLMs), to new applications. Our approach enables next-token prediction across new applications of protein-prediction tasks, including protein-protein structure similarity, 3D structure prediction, mutation stability, post-translational modifications (PTMs), substrate-kinase phosphorylation sites, protein-protein affinity, and protein-ion binding sites. We introduce a self-supervised pre-training stage for the decoder, enhancing model initialization and improving downstream predictions. By integrating a causal autoregressive transformer with a pre-trained ESM-2 encoder, our model effectively aligns diverse protein tasks within a single framework. Additionally, we discuss the opportunities and limitations of this approach, providing insights for future research in optimizing PLMs as a general tool for broader biological applications.
Attendance: Dong Xu
Submission Number: 87
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview