YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

ICLR 2026 Conference Submission19998 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Alignment, Preference Optimization, Activation Steering, Sparse Autoencoders, Domain Adaptation, Large Language Models

TL;DR: We introduce YaPO, a reference-free method that learns sparse steering vectors via SAEs, enabling efficient, stable, and fine-grained alignment of large language models.

Abstract: Steering large language models (LLMs) through activation interventions has emerged as a lightweight alternative to fine-tuning for alignment and personalization. Recent work on Bi-directional Preference Optimization (BiPO) shows that dense steering vectors can be learned directly from preference data, in a Direct Preference Optimization (DPO) fashion, enabling control over truthfulness, hallucinations, and safety behaviors. However, dense steering vectors often entangle multiple latent factors due to neuron multi-semanticity, which limits their effectiveness and stability in fine-grained settings such as cultural alignment, where closely related values and behaviors (e.g., among Middle Eastern cultures) must be distinguished. In this paper, we propose $\textbf{Yet Another Policy Optimization (YaPO)}$, a $\textbf{reference-free}$ method that learns $\textbf{sparse steering vectors}$ in the latent space of a $\textbf{Sparse Autoencoder (SAE)}$. By optimizing sparse codes, YaPO produces disentangled, interpretable, and efficient steering directions. Empirically, we show that sparse steering vectors converge faster, achieve lower training and evaluation loss, and remain more stable throughout training compared to dense counterparts. Beyond cultural alignment, YaPO generalizes to diverse alignment-related behaviors studied in BiPO, including truthfulness, hallucination mitigation, and jailbreak defense. Our results demonstrate that YaPO sparse steering provides a general recipe for efficient, stable, and fine-grained alignment of LLMs, with broad implications for controllability and domain adaptation.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 19998

Loading