Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Probing, AI Safety, Steering
Other Keywords: Mechanistic Interpretability, Harmfulness, Linear Probing
TL;DR: We probe 55 harmfulness subconcepts in LLMs, showing they form a low-rank subspace. Steering its dominant direction nearly removes harmfulness with minimal utility loss, offering a scalable lens for model safety.
Abstract: Recent advances in large language models (LLMs) have intensified the need to understand and reliably curb their harmful behaviours. We introduce a multidimensional framework for probing and steering harmful content in model internals. For each of 55 distinct harmfulness subconcepts (e.g., racial hate, employment scams, weapons), we learn a linear probe, yielding 55 interpretable directions in activation space. Collectively, these directions span a harmfulness subspace that we show is strikingly low-rank. We then test ablation of the entire subspace from model internals, as well as steering and ablation in the subspace's dominant direction. We find that dominant direction steering allows for near elimination of harmfulness with a low decrease in utility. Our findings advance the emerging view that concept subspaces provide a scalable lens on LLM behaviour and offer practical tools for the community to audit and harden future generations of language models.
Submission Number: 283
Loading