A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Steering Vectors in Large Language Models
Keywords: Steering Vectors, Activation Engineering
TL;DR: SteerCLR is an unsupervised contrastive method that learns a bank of steering vectors on a frozen LLM (no labels, no finetuning) and uses them at inference to reliably nudge behaviors.
Abstract: Large language models (LLMs) possess impressive generative capabilities but remain opaque and can exhibit unsafe or undesired behaviors. Existing control methods rely on supervised fine-tuning or curated prompt-response datasets, which limits their scalability. We propose SteerCLR, an unsupervised method that simultaneously discovers a bank of diverse and disentangled steering vectors directly from unlabeled prompts. By optimizing a novel contrastive objective over internal model activations, SteerCLR learns vectors that correspond to distinct behavioral shifts. Injecting these vectors into a frozen LLM enables fine-grained, low-latency control over generation, including suppressing toxicity, modulating sentiment, and uncovering subtle stylistic dimensions, without relying on labeled data, classifiers, or attribute-specific supervision. We demonstrate that optimizing for activation magnitude and activation diversity yields a rich set of interpretable directions. Experiments on instruction-tuned Llama-2-13B-chat model show SteerCLR discovers diverse interpretable steering vectors in a single training run, significantly advancing the scalability of mechanistic interpretability and enabling practical interventions for safety, alignment, and model auditing.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 22318
Loading