Mixture-of-Steering Vectors (MoSV): Sparse Gating for Compositional Hallucination Mitigation

Vedant Srinivas; Daniel Lee; Feolu Kolawole

Mixture-of-Steering Vectors (MoSV): Sparse Gating for Compositional Hallucination Mitigation

Vedant Srinivas, Daniel Lee, Feolu Kolawole

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Methods (probing, steering, causal interventions), Feature Geometry, Applications of interpretability

TL;DR: MoSV replaces a single steering vector with a learned mixture routed per prompt, improving factual accuracy and revealing unsupervised domain structure in activation space.

Abstract: Large language models remain prone to hallucination despite advances in scale and training, yet existing inference-time steering methods apply a single global correction vector to every input, treating all hallucinations as one monolithic failure mode. We propose Mixture-of-Steering Vectors (MoSV), which discovers multiple hallucination subspaces from contrastive activation data via unsupervised clustering, then trains a lightweight sparse router to select and compose the appropriate vector(s) per prompt at inference time. Evaluated on DefAn (Rahman et al., 2024), a factual QA benchmark spanning eight structured knowledge domains (n=10,615), MoSV-K8 improves exact-match accuracy from 19.7% (Vanilla) to 22.1% (+2.4pp, p=9.1×10−6), while single-vector CAA yields only a negligible, nonsignificant gain (+0.3pp). A random-routing ablation, which selects vectors without the learned router, degrades accuracy to 17.4% (−2.3pp), confirming that per-prompt routing is the operative mechanism. Analysis reveals that K-means clusters of contrastive diff vectors recover ground truth domain boundaries without any label supervision, providing an unsupervised account of why compositional steering is effective.

Submission Number: 462

Loading