Keywords: Feature Geometry, Concept Discovery (e.g., SAEs, dictionary learning), Benchmarking Interpretability
TL;DR: We evaluate multinomial concept representations in LLMs by modeling them as affine subspaces.
Abstract: The Linear Representation Hypothesis suggests that concepts in language models can be represented as linear directions in activation space. While empirically effective for binary concepts, more recent work suggests that representations of multinomial concepts may exhibit inherently multidimensional structure, such as circular or curved manifold geometries. Efficiently discovering multinomial concept representations and evaluating whether they precisely cover the full concept space remains a significant challenge. Building on prior success in identifying non-basis-aligned directions for binary concepts, we model multinomial concepts as affine subspaces, which can be viewed as multidimensional generalizations of directions (optionally with an offset). We introduce methods for locating these affine concept subspaces in language model activation spaces, along with evaluation methods to characterize the precision and recall of recovered multinomial concept spaces. We demonstrate an application where we steer LLMs by sampling points from the recovered subspace; unlike one-dimensional steering, sampling enables us to steer a model's behavior toward diverse but concept-related behaviors.
Submission Number: 576
Loading