Probing by Analogy: Decomposing Probes into Activations for Better Interpretability and Inter-Model Generalization
Keywords: Probing, Sparse Autoencoders
TL;DR: We decompose linear probes into linear combinations of training activations and study how they generalise between models
Abstract: Linear probes have been used to demonstrate that LLM activations linearly encode high-level properties of the input, such as truthfulness, and that these directions can evolve significantly during fine-tuning and training. However, despite their seeming simplicity, linear probes can have complex geometric interpretations, leverage spurious correlations, and lack selectivity. We present a method for decomposing linear probe directions into weighted sums of as few as 10 model activations, whilst maintaining task performance. These probes are also invariant to affine transformations of the representation space, and we demonstrate that, in some cases, poor base to fine-tune probe generalization performance is partially due to simple transformations of representation subspaces, and the structure of the representation space changes less than indicated by other methods.
Submission Number: 63
Loading