Abstract: h3>Abstract</h3> <p>Protein language models have become essential tools for engineering novel functional proteins. The emerging paradigm of family-based language models makes use of homologous sequences to steer protein design and enhance zero-shot fitness prediction, by imbuing models with an ability to explicitly reason over evolutionary context. To provide an open foundation for this modelling approach, we introduce <i>ProFam-1</i>, a 251M-parameter autoregressive protein family language model (pfLM) trained with next-token prediction on millions of protein families represented as concatenated, unaligned sets of sequences. <i>ProFam-1</i> is competitive with state-of-the-art models on the ProteinGym zero-shot fitness prediction benchmark, achieving Spearman correlations of 0.47 for substitutions and 0.53 for indels. For homology-guided generation, <i>ProFam-1</i> generates diverse sequences with predicted structural similarity, while preserving residue conservation and covariance patterns. All of ProFam’s training and inference pipelines, together with our curated, large-scale training dataset <i>ProFam Atlas</i>, are released fully open source, lowering the barrier to future method development.</p>
External IDs:doi:10.64898/2025.12.19.695431
Loading