Keywords: Protein language models, Variant interpretation, Multimodal generative models, Gated cross-attention, Perceiver resampler, Mutation reasoning, Biomedical question answering, Explainable AI, Pathogenicity prediction, Mutation2TextQA
TL;DR: We introduce Mutation2Text, a multimodal protein language model that generates natural language explanations for diverse mutations, surpassing baselines in pathogenicity and functional interpretation.
Abstract: Understanding the functional consequences of protein mutations is crucial for diagnosing and preventing diseases like cancer. However, existing protein language models (PLMs) are limited by their black-box nature, inability to process full-length proteins without truncation, primary focus on single-nucleotide variants, and lack of directional interpretation (e.g., distinguishing loss-of-function from gain-of-function). We introduce $Mutation2Text$, a multimodal generative PLM designed to generate human-understandable, rationale-based explanations for diverse mutations, including substitutions, insertions, deletions, and frameshifts. Mutation2Text employs a gated cross-attention mechanism to explicitly contrast wild-type and mutated proteins and uses a Perceiver Resampler for length-invariant encoding. We constructed Mutation2TextQA, the largest mutation interpretation dataset to date, comprising millions of question-answer pairs with substantial lexical and semantic diversity, mined from published literature, facilitating robust generalization across mutation contexts. Mutation2Text training follows a progressive three-stage approach: (1) foundational protein function grounding with UniProt annotations; (2) comprehensive biological understanding using 10 million biomedical literature-derived QA pairs; and (3) mutation-focused reasoning leveraging Mutation2TextQA. Mutation2Text consistently outperforms baselines on pathogenicity prediction, functional classification, mutation annotation, and open-ended mutation QA tasks. Our analysis reveals a novel use of LLMs for variant interpretation, providing transparent and rationale-driven predictions to enhance clinical interpretability.
Primary Area: generative models
Submission Number: 9624
Loading