Keywords: Protein language models, Mechanistic interpretability, Sparse autoencoders, Enzyme design
TL;DR: Sparse Autoencoders in Protein Engineering
Abstract: Protein Language Models (pLM) have proven versatile tools in protein design, but their internal workings remain difficult to interpret. Here, we implement a mechanistic interpretability framework and apply it in two scenarios. First, by training sparse autoencoders (SAEs) on the model activations, we identify and annotate features relevant to enzyme variant activity through a two-stage process involving candidate selection and causal intervention. During sequence generation, we steer the model by clamping or ablating key SAE features, which increases the predicted enzyme activity. Additionally, we implement a intervention strategy: \textit{MSA-steering}, which projects SAE latents in the multiple sequence alignment dimensionality of our case study enzyme. Second, we compare pLM checkpoints before and after three rounds of Reinforcement Learning (RL) by examining sequence regions with high divergence of per-token log-likelihood, detecting the residues that most align with higher predicted affinities. Overall, we present a strategy to apply SAE for protein engineering.
Submission Number: 165
Loading