Localizing and Evaluating Multinomial Concepts with Affine Subspaces

Divya Appapogu; Freya Behrens; Yonatan Belinkov; Aaron Mueller

Localizing and Evaluating Multinomial Concepts with Affine Subspaces

Divya Appapogu, Freya Behrens, Yonatan Belinkov, Aaron Mueller

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Feature Geometry, Concept Discovery (e.g., SAEs, dictionary learning), Benchmarking Interpretability

TL;DR: We evaluate multinomial concept representations in LLMs by modeling them as affine subspaces.

Abstract: The Linear Representation Hypothesis suggests that concepts in language models can be represented as linear directions in activation space. While empirically effective for binary concepts, more recent work suggests that representations of multinomial concepts may exhibit inherently multidimensional structure, such as circular or curved manifold geometries. Efficiently discovering multinomial concept representations and evaluating whether they precisely cover the full concept space remains a significant challenge. Building on prior success in identifying non-basis-aligned directions for binary concepts, we model multinomial concepts as affine subspaces, which can be viewed as multidimensional generalizations of directions (optionally with an offset). We introduce methods for locating these affine concept subspaces in language model activation spaces, along with evaluation methods to characterize the precision and recall of recovered multinomial concept spaces. We demonstrate an application where we steer LLMs by sampling points from the recovered subspace; unlike one-dimensional steering, sampling enables us to steer a model's behavior toward diverse but concept-related behaviors.

Submission Number: 576

Loading