Mutation2Text:  A Unified Protein and Text Language Model for Explaining Mutation Effects

Oladimeji Macaulay; Ala Jararweh; Yue Hu; David Arredondo; Shrey Poshiya; Luis E Tafoya; Robert McCourt; Kushal Virupakshappa; Avinash Sahu

Mutation2Text: A Unified Protein and Text Language Model for Explaining Mutation Effects

Oladimeji Macaulay, Ala Jararweh, Yue Hu, David Arredondo, Shrey Poshiya, Luis E Tafoya, Robert McCourt, Kushal Virupakshappa, Avinash Sahu

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Protein language models, Variant interpretation, Multimodal generative models, Gated cross-attention, Perceiver resampler, Mutation reasoning, Biomedical question answering, Explainable AI, Pathogenicity prediction, Mutation2TextQA

TL;DR: We introduce Mutation2Text, a multimodal protein language model that generates natural language explanations for diverse mutations, surpassing baselines in pathogenicity and functional interpretation.

Abstract: Understanding the functional consequences of protein mutations is crucial for diagnosing and preventing diseases like cancer. However, existing protein language models (PLMs) are limited by their black-box nature, inability to process full-length proteins without truncation, primary focus on single-nucleotide variants, and lack of directional interpretation (e.g., distinguishing loss-of-function from gain-of-function). We introduce $Mutation2Text$, a multimodal generative PLM designed to generate human-understandable, rationale-based explanations for diverse mutations, including substitutions, insertions, deletions, and frameshifts. Mutation2Text employs a gated cross-attention mechanism to explicitly contrast wild-type and mutated proteins and uses a Perceiver Resampler for length-invariant encoding. We constructed Mutation2TextQA, the largest mutation interpretation dataset to date, comprising millions of question-answer pairs with substantial lexical and semantic diversity, mined from published literature, facilitating robust generalization across mutation contexts. Mutation2Text training follows a progressive three-stage approach: (1) foundational protein function grounding with UniProt annotations; (2) comprehensive biological understanding using 10 million biomedical literature-derived QA pairs; and (3) mutation-focused reasoning leveraging Mutation2TextQA. Mutation2Text consistently outperforms baselines on pathogenicity prediction, functional classification, mutation annotation, and open-ended mutation QA tasks. Our analysis reveals a novel use of LLMs for variant interpretation, providing transparent and rationale-driven predictions to enhance clinical interpretability.

Primary Area: generative models

Submission Number: 9624

Loading