Keywords: Methods (probing, steering, causal interventions), Feature Geometry, Applications of interpretability
TL;DR: MoSV replaces a single steering vector with a learned mixture routed per prompt, improving factual accuracy and revealing unsupervised domain structure in activation space.
Abstract: Large language models remain prone to hallucination despite advances in scale and training, yet existing inference-time steering methods apply a single global correction vector to every input, treating all hallucinations as one monolithic failure mode. We propose Mixture-of-Steering Vectors (MoSV), which discovers multiple hallucination subspaces from contrastive activation data via unsupervised clustering, then trains a lightweight sparse router to select and compose the appropriate vector(s) per prompt at inference time. Evaluated on DefAn (Rahman et al., 2024), a factual QA benchmark spanning eight structured knowledge domains (n=10,615), MoSV-K8 improves exact-match accuracy from 19.7% (Vanilla) to 22.1% (+2.4pp, p=9.1×10−6), while single-vector CAA yields only a negligible, nonsignificant gain (+0.3pp). A random-routing ablation, which selects vectors without the learned router, degrades accuracy to 17.4% (−2.3pp), confirming that per-prompt routing is the operative mechanism. Analysis reveals that K-means clusters of contrastive diff vectors recover ground truth domain boundaries without any label supervision, providing an unsupervised account of why compositional steering is effective.
Submission Number: 462
Loading