QAProt: Enabling Sequence-to-Text Protein Function Learning with a Comprehensive QA Corpus

ICLR 2026 Conference Submission13362 Authors

18 Sept 2025 (modified: 26 Jan 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Question–answer dataset, Large-scale protein dataset, Language models
TL;DR: We present QAProt, the largest and most diverse protein question–answer dataset mined from scientific literature, enabling improved sequence-to-text modeling and zero-shot functional prediction beyond structural insights.
Abstract: Inferring protein function from sequence is a grand challenge in genomics, yet progress is bottlenecked by the narrow, template-driven datasets available for training. These datasets, derived from structured databases, fail to leverage the rich diversity of knowledge in scientific literature. To address this gap, we introduce **QAProt**, a large-scale corpus with over 987,000 free-form question-answer pairs mined directly from PubMed abstracts, capturing broader topical and linguistic variability than existing resources. To ensure high fidelity, we developed a rigorous multi-LLM cleaning pipeline that yields a 13 times reduction in estimated hallucination rates. Our analyses reveal that current protein LLMs exhibit a performance collapse when tested on the realistic distribution of taxa and functions found in QAProt, highlighting the complementary nature of our literature-derived data distribution. A single epoch of fine-tuning on our dataset yields remarkable improvements, including an 86% performance gain on previously unseen protein domains. QAProt is a complementary new resource that enables the development of more powerful, generalizable models for protein science. Dataset available anonymously at https://huggingface.co/conferenceacc/QAProt.
Primary Area: datasets and benchmarks
Submission Number: 13362
Loading