Protein as a Second Language for LLMs

ICLR 2026 Conference Submission5183 Authors

14 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language models; Protein–QA dataset; Context-Driven Learning; Zero-shot learning
TL;DR: We propose a protein–language framework and bilingual dataset that enable LLMs to reason about protein function via context-driven learning without fine-tuning.
Abstract: Deciphering the function of unseen protein sequences is a fundamental challenge with broad scientific impact, yet most existing methods depend on task-specific adapters or large-scale supervised fine-tuning. We introduce the "Protein-as-Second-Language" framework, which reformulates amino-acid sequences as sentences in a novel symbolic language that large language models can interpret through contextual exemplars. Our approach adaptively constructs sequence–question–answer triples that reveal functional cues in a zero-shot setting, without any further training. To support this process we curate a bilingual corpus of 79,926 protein–QA instances spanning attribute prediction, descriptive understanding, and extended reasoning. Empirically, our method delivers consistent gains across diverse open-source LLMs and GPT-4o, achieving up to 17.2% ROUGE-L improvement (average +7%) and even surpassing fine-tuned protein-specific language models. These results highlight that generic LLMs, when guided with protein-as-language cues, can outperform domain-specialized models, offering a scalable pathway for protein understanding in foundation models.
Primary Area: datasets and benchmarks
Submission Number: 5183
Loading