Context Matters: Analyzing the Generalizability of Linear Probing and Steering Across Diverse Scenarios

Published: 30 Sept 2025, Last Modified: 17 Nov 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Open Source Links: https://github.com/agarwali11/multi-view-capabilities
Keywords: Steering, Probing, Interpretability tooling and software
TL;DR: We find that contextually trained linear probes and steering vectors generalize well on unseen contexts for the same capability.
Abstract: Previous works in mechanistic interpretability have attempted to represent model capabilities beyond looking at a single general direction in the subspace of model activations, however, many of these works neglect to consider how context impacts capability representation in the latent activation space. We hypothesize model behaviors like sycophancy or refusal are sets of related directions clustered together by the significant context they represent. To test this hypothesis, we generate a synthetic dataset for $5$ different capabilities across $5$ different, diverse contexts each. We use this dataset to train context-specific steering vectors and linear probes and measure their performance on contexts out of distribution from their training. We find that contextually trained steering vectors and linear probe are able to recover $95\%$ and $85\%$ accuracy respectively on unseen contexts, suggesting that general capability representations independent of context can be learned and effectively applied in contextually-specific settings. Our work contributes to a deeper understanding of how capabilities are represented across many contexts in the model's latent activation space and bolsters confidence in applying steering and linear probing techniques in unseen settings that may be critical for safety.
Submission Number: 314
Loading