From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models

Etowah Adams; Liam Bai; Minji Lee; Yiyang Yu; Mohammed AlQuraishi

From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models

Etowah Adams, Liam Bai, Minji Lee, Yiyang Yu, Mohammed AlQuraishi

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We train and evaluate SAEs to identify interpretable features in pLMs and show their potential for scientific discovery.

Abstract: Protein language models (pLMs) are powerful predictors of protein structure and function, learning through unsupervised training on millions of protein sequences. pLMs are thought to capture common motifs in protein sequences, but the specifics of pLM features are not well understood. Identifying these features would not only shed light on how pLMs work, but potentially uncover novel protein biology––studying the model to study the biology. Motivated by this, we train sparse autoencoders (SAEs) on the residual stream of a pLM, ESM-2. By characterizing SAE features, we determine that pLMs use a combination of generic features and family-specific features to represent a protein. In addition, we demonstrate how known sequence determinants of properties such as thermostability and subcellular localization can be identified by linear probing of SAE features. For predictive features without known functional associations, we hypothesize their role in unknown mechanisms and provide visualization tools to aid their interpretation. Our study gives a better understanding of the limitations of pLMs, and demonstrates how SAE features can be used to help generate hypotheses for biological mechanisms. We release our code, model weights, and feature visualizer.

Lay Summary: Protein language models (pLMs) are models trained on large amounts of protein data. They are effective at making predictions on protein structure and function, but it is not clear how: which patterns do they rely on? To explore this, we used a method called sparse autoencoders (SAEs) to simplify and understand the complex features learned by ESM-2, a popular pLM. We showed that some patterns discovered by SAEs match known protein characteristics. The models rely on a mix of general patterns common to many proteins and specialized patterns unique to certain families of proteins. For other patterns whose roles aren’t yet known, we developed visualization tools to help researchers interpret them and form new scientific hypotheses. Our work helps clarify the strengths and weaknesses of protein language models and shows how studying their internal features can lead to new biological discoveries.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/etowahadams/interprot

Primary Area: Applications->Health / Medicine

Keywords: protein language model, sparse autoencoder, mechanistic interpretability, scientific discovery

Submission Number: 13328

Loading