ProFam: Open-Source Protein Family Language Modelling for Fitness Prediction and Design

Jude Wells, Alex Hawkins Hooker, Micha Livne, Weining Lin, David Miller, Christian Dallago, Nicola Bordin, Brooks Paige, Burkhard Rost, Christine Orengo, Michael Heinzinger

Published: 21 Dec 2025, Last Modified: 27 Jan 2026CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: h3>Abstract</h3> Protein language models have become essential tools for engineering novel functional proteins. The emerging paradigm of family-based language models makes use of homologous sequences to steer protein design and enhance zero-shot fitness prediction, by imbuing models with an ability to explicitly reason over evolutionary context. To provide an open foundation for this modelling approach, we introduce ProFam-1, a 251M-parameter autoregressive protein family language model (pfLM) trained with next-token prediction on millions of protein families represented as concatenated, unaligned sets of sequences. ProFam-1 is competitive with state-of-the-art models on the ProteinGym zero-shot fitness prediction benchmark, achieving Spearman correlations of 0.47 for substitutions and 0.53 for indels. For homology-guided generation, ProFam-1 generates diverse sequences with predicted structural similarity, while preserving residue conservation and covariance patterns. All of ProFam’s training and inference pipelines, together with our curated, large-scale training dataset ProFam Atlas, are released fully open source, lowering the barrier to future method development.

External IDs:doi:10.64898/2025.12.19.695431