Interpreting and Steering Protein Language Models through Sparse Autoencoders

Published: 06 Mar 2025, Last Modified: 26 Apr 2025GEMEveryoneRevisionsBibTeXCC BY 4.0
Track: Machine learning: computational method and/or computational results
Nature Biotechnology: Yes
Keywords: proteins, interpretability, sparse autoencoders
TL;DR: We applied sparse autoencoders to protein language models and successfully steered the model to generate sequences with a specific domain.
Abstract: The rapid advancements in transformer-based language models have revolutionized natural language processing, yet understanding the internal mechanisms of these models remains a significant challenge. This paper explores the application of sparse autoencoders (SAE) to interpret the internal representations of protein language models, specifically focusing on the ESM-2 8M parameter model. By performing a statistical analysis on each latent component’s relevance to distinct protein annotations, we identify potential interpretations linked to various protein characteristics, including transmembrane regions, binding sites, and specialized motifs. We then leverage these insights to guide sequence generation, shortlisting the relevant latent components that can steer the model toward desired targets such as zinc finger domains. This work contributes to the emerging field of mechanistic interpretability in biological sequence models, offering new perspectives on model steering for sequence design.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Presenter: ~Edith_Natalia_Villegas_Garcia1
Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 40
Loading