Likelihood-based Finetuning of Protein Language Models for Few-shot Fitness Prediction and Design

21 Jan 2025 (modified: 18 Jun 2025)Submitted to ICML 2025EveryoneRevisionsBibTeXCC BY 4.0
Abstract: Protein language models (PLMs) implicitly learn distributional constraints on protein sequences upheld over the course of evolution. As a consequence, the sequence and mutation-level likelihoods of such models form effective zero-shot predictors of mutations. Although various schemes have been proposed for exploiting the distributional knowledge captured by PLMs to enhance supervised fitness prediction and sequence design tasks, a lack of head-to-head comparison across different prediction strategies and different classes of PLM has made it challenging to identify the best-performing methods. Our contribution is to extend previously proposed ranking-based loss functions to develop likelihood scoring functions for *family-based* and *masked* PLMs. We demonstrate that in the low-data setting the best configurations outperform the current SOTA approach that is based on frozen embeddings. Furthermore, we propose ensembling strategies that exploit the strong dependence of the mutational distributions learned by PLMs on sequence context, showing that they can be used to guide efficient optimisation strategies over fitness landscapes.
Primary Area: Applications->Everything Else
Keywords: Protein Language Models, Low-data Finetuning, Protein Design
Submission Number: 4685
Loading