Mechanistic Interpretability of Antibody Language Models Using SAEs

Rebonto Haque; Oliver M. Turnbull; Anisha Parsan; Nithin Parsan; John Jingxuan Yang; Charlotte Deane

Mechanistic Interpretability of Antibody Language Models Using SAEs

Rebonto Haque, Oliver M. Turnbull, Anisha Parsan, Nithin Parsan, John Jingxuan Yang, Charlotte Deane

Published: 02 Mar 2026, Last Modified: 26 May 2026GEM 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: mechanistic interpretability, antibody language models, drug discovery

TL;DR: TopK SAEs can be used for the mechanistic interpretability of antibody language models and Ordered SAEs can be used to steer their generation

Abstract: Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate autoregressive antibody language models, and steer their generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature–concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose a hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs suffice for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.

Submission Number: 65

Loading