A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Steering Vectors in Large Language Models

Tuna Han Salih Meral; Connor Dunlop; Sanmitra Bhattacharya; Balaji Veeramani; Stephen Adams; Edward Bowen; Pinar Yanardag

A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Steering Vectors in Large Language Models

Tuna Han Salih Meral, Connor Dunlop, Sanmitra Bhattacharya, Balaji Veeramani, Stephen Adams, Edward Bowen, Pinar Yanardag

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Steering Vectors, Activation Engineering

TL;DR: SteerCLR is an unsupervised contrastive method that learns a bank of steering vectors on a frozen LLM (no labels, no finetuning) and uses them at inference to reliably nudge behaviors.

Abstract: Large language models (LLMs) possess impressive generative capabilities but remain opaque and can exhibit unsafe or undesired behaviors. Existing control methods rely on supervised fine-tuning or curated prompt-response datasets, which limits their scalability. We propose SteerCLR, an unsupervised method that simultaneously discovers a bank of diverse and disentangled steering vectors directly from unlabeled prompts. By optimizing a novel contrastive objective over internal model activations, SteerCLR learns vectors that correspond to distinct behavioral shifts. Injecting these vectors into a frozen LLM enables fine-grained, low-latency control over generation, including suppressing toxicity, modulating sentiment, and uncovering subtle stylistic dimensions, without relying on labeled data, classifiers, or attribute-specific supervision. We demonstrate that optimizing for activation magnitude and activation diversity yields a rich set of interpretable directions. Experiments on instruction-tuned Llama-2-13B-chat model show SteerCLR discovers diverse interpretable steering vectors in a single training run, significantly advancing the scalability of mechanistic interpretability and enabling practical interventions for safety, alignment, and model auditing.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 22318

Loading