Evolutionary Profiles for Protein Fitness Prediction

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Protein Fitenss Prediction;
TL;DR: EvoIF achieves SOTA protein fitness prediction on ProteinGym by combining evolutionary profiles from homologs and inverse folding, using only 0.15% training data, inspired by an IRL perspective on protein evolution.
Abstract: Predicting the fitness impact of mutations is central to protein engineering but constrained by limited assays relative to the size of sequence space. Protein language models (pLMs) trained with masked language modeling (MLM) exhibit strong zero-shot fitness prediction; we provide a unifying view by interpreting natural evolution as implicit reward maximization and MLM as inverse reinforcement learning (IRL), in which extant sequences act as expert demonstrations and pLM log-odds serve as fitness estimates. Building on this perspective, we introduce EvoIF, a lightweight model that integrates two complementary sources of evolutionary signal: (i) within-family profiles from retrieved homologs and (ii) cross-family structural–evolutionary constraints distilled from inverse folding logits. EvoIF fuses sequence–structure representations with these profiles via a compact transition block, yielding calibrated probabilities for log-odds scoring. On ProteinGym (217 mutational assays; >2.5M mutants), EvoIF and its MSA-enabled variant achieve state-of-the-art or competitive performance while using only 0.15% of the training data and fewer parameters than recent large models. Ablations confirm that within-family and cross-family profiles are complementary, improving robustness across function types, MSA depths, taxa, and mutation depths.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 9199
Loading