Capturing Multi-Facet Concepts in Language Models: Winograd-Style and Hangman-Style Contrastive Steering Vectors

Capturing Multi-Facet Concepts in Language Models: Winograd-Style and Hangman-Style Contrastive Steering Vectors

ACL ARR 2026 January Submission10242 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, Mechanistic Interpretability, Steering Vectors, Multilinguality

Abstract: Mechanistic interpretability aims to open the black-box nature of Large Language Models (LLMs) by uncovering their internal computational mechanisms. While post-hoc methods effectively inspect trained LLMs, showing how components interact and what behaviors emerge, they remain descriptive, not prescriptive: they reveal what a model does but not why, like reading assembly code without seeing high-level logic. To bridge the gap from description to intervention, steering vectors offer a mechanism to manipulate latent representations; however, standard extraction methods relying on Externally-Anchored Contrasts (EAC), typically multiple-choice or labeled pairs, often capture surface-level task artifacts rather than the target semantic concept. To address this limitation, we present a comprehensive comparative analysis of steering construction methods, contrasting the EAC baseline against Internally-Elicited Contrasts (IEC), a novel approach utilizing Winograd-style and Hangman-style (associative) templates. Evaluating Llama-3.1 and Qwen-2.5 across the guilt (causal) problem and polysemy problem, we demonstrate that the efficacy of steering vectors is highly domain dependent: Winograd-style contrasts excel at capturing causal logic, inducing smooth contextual reinterpretations, while Hangman-style contrasts are superior for isolating intrinsic word senses via sharp lexical overrides. Furthermore, we explore the generalizability of these computed steering vectors, finding that vectors transfer effectively between related/close languages (e.g., English $\rightarrow$ French) but degrade rapidly across distant language families (e.g., English $\rightarrow$ Chinese). We release our code and datasets to support future work on concept probing and model steering.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: contrastive explanations; knowledge tracing; model editing; probing; robustness;

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 10242

Loading