TL;DR: We evaluate sparse autoencoders for probing and find that they underperform baselines; our results raise questions about the effectiveness of current SAEs.
Abstract: Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations. However, there is a lack of evidence regarding the validity of their interpretations due to the lack of a ground truth for the concepts used by an LLM, and a growing number of works have presented problems with current SAEs. One alternative source of evidence would be demonstrating that SAEs improve performance on downstream tasks beyond existing baselines. We test this by applying SAEs to the real-world task of LLM activation probing in four regimes: data scarcity, class imbalance, label noise, and covariate shift. Due to the difficulty of detecting concepts in these challenging settings, we hypothesize that SAEs’ basis of interpretable, concept-level latents should provide a useful inductive bias. However, although SAEs occasionally perform better than baselines on individual datasets, we are unable to design ensemble methods combining SAEs with baselines that consistently outperform ensemble methods solely using baselines. Additionally, although SAEs initially appear promising for identifying spurious correlations, detecting poor dataset quality, and training multi-token probes, we are able to achieve similar results with simple non-SAE baselines as well. Though we cannot discount SAEs’ utility on other tasks, our findings highlight the shortcomings of current SAEs and the need to rigorously evaluate interpretability methods on downstream tasks with strong baselines.
Lay Summary: Large language models (LLMs) can do impressive things, but even though we have created them, we really have no idea how they work. Sparse autoencoders (SAEs) are a recent tool that tries to understand LLMs by turning the LLM's internal state (an uninterpretable list of numbers) into a human-interpretable set of "thoughts" that the LLM is currently thinking about. For example, when the LLM is given input that says "the quick brown fox jumps over a lazy dog", the SAE might return that the LLM is thinking about animals, colors, movements, pangrams, poetry, and whimsey.
However, it's not clear whether SAEs are actually useful. Maybe they're just telling us things we already know, or could have guessed from the original internal state. To test this, we looked at probing, a task where we want to predict something ("is the LLM telling the truth," or "does the input contain a dangerous instruction") using the LLM's internal state. We find that even in difficult probing settings (e.g. where we don't have many LLM internal states available, or the labels of our probe training are very noisy), the SAE doesn't seem to help. Even in cases where you might expect an SAE to shine, like where a probe on the internal states might latch on to something spurious, we find ways to train probes without an SAE.
Our results highlight the shortcomings of SAEs and imply that we should better test our interpretability methods.
Link To Code: https://github.com/JoshEngels/SAE-Probing
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: Mechanistic Interpretability, Sparse Autoencoders, Probing
Submission Number: 5510
Loading