Residue-level text conditioning for protein language model mutation effect prediction

Dan Berenberg; Nate Gruver; Alan Nawzad Amin; Peter Mørch Groth; Leo Chen; Harsh R. Srivastava; Pascal Notin; Debora Susan Marks; Andrew Gordon Wilson; Kyunghyun Cho; Richard Bonneau

Residue-level text conditioning for protein language model mutation effect prediction

Dan Berenberg, Nate Gruver, Alan Nawzad Amin, Peter Mørch Groth, Leo Chen, Harsh R. Srivastava, Pascal Notin, Debora Susan Marks, Andrew Gordon Wilson, Kyunghyun Cho, Richard Bonneau

Published: 06 Mar 2025, Last Modified: 26 Apr 2025GEMEveryoneRevisionsBibTeXCC BY 4.0

Track: Machine learning: computational method and/or computational results

Nature Biotechnology: Yes

Keywords: text-conditioning, protein engineering, mutation effect prediction, protein language model, model fusion

TL;DR: We introduce a general purpose residue-level text conditioning method for protein language models and demonstrate its capabilities on the ProteinGym mutation effect prediction benchmark.

Abstract: To augment protein sequence models with language, we introduce Conditioning on Residue-level Annotations from TExt (CRATE), a fine-tuning method that fuses two models using feature-wise linear modulation. We fine-tune protein language models at a large scale, first constructing a dataset (CRATE-train) joining annotations from InterPro and UniProtKB with sequences from UniRef90 resulting in approximately 105 million sequences each with at least three annotations and nearly 100\% sequence coverage on average. Applying CRATE to mutation effect prediction improves performance on the ProteinGym over prior benchmarks. Leveraging these improvements, we show CRATE can be used to select annotations with the largest positive impact on mutation effect prediction and estimate the deep mutational scan (DMS) scores tested over multiple different assay selection types.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Presenter: ~Dan_Berenberg1

Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).

Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.

Submission Number: 57

Loading