Steering Vector Transfer via Orthonormal Transformations and Semantic Pairing

20 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: steering vectors, cross-model transfer, representation alignment, Procrustes analysis, semantic pairing, language models, behavioral control, linear transformations, orthonormal transformations
TL;DR: Evidence for Platonic Hypothesis: LLMs share behavioral geometry, enabling steering vector transfer via rotations+scaling. Semantic pairing critical: 0.31→0.53 vs 0.00 without it.
Abstract: Steering vectors—directions in activation space encoding behavioral traits like formality or creativity—enable fine-grained control over language model outputs but must be regenerated for each new model, creating deployment barriers. We present a method for transferring these vectors between different language models by learning structure-preserving transformations while matching corresponding text pairs across models. Our approach achieves strong alignment (0.50-0.56 cosine similarity, where 1.0 represents perfect alignment and 0.0 represents random chance) across all model pairs tested. Crucially, we demonstrate that semantic pairing—ensuring each text contrast is matched across models during training—improves transfer performance by 72%: proper pairing achieves 0.529 cosine similarity compared to 0.308 with shuffled pairs within traits and 0.00 with random pairing across traits. We evaluate our method across 26 behavioral traits on three architecturally distinct models (Gemma-7B, LLaMA-3-8B, and Mistral-7B), using dimensionality reduction to handle their different hidden dimensions. Our results provide evidence for the Platonic Representation Hypothesis, showing that different language models encode behavioral preferences in similar geometric structures. This enables practical reuse of curated steering vectors across model families and advances our understanding of how neural networks represent human preferences.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 23461
Loading