Keywords: Alignment, Preference Optimization, Activation Steering, Sparse Autoencoders, Domain Adaptation, Large Language Models
TL;DR: We introduce YaPO, a reference-free method that learns sparse steering vectors via SAEs, enabling efficient, stable, and fine-grained alignment of large language models.
Abstract: Steering large language models (LLMs) through activation interventions has emerged as a lightweight alternative to fine-tuning for alignment and personalization. Recent work on Bi-directional Preference Optimization (BiPO) shows that dense steering vectors can be learned directly from preference data, in a Direct Preference Optimization (DPO) fashion, enabling control over truthfulness, hallucinations, and safety behaviors. However, dense steering vectors often entangle multiple latent factors due to neuron multi-semanticity, which limits their effectiveness and stability in fine-grained settings such as cultural alignment, where closely related values and behaviors (e.g., among Middle Eastern cultures) must be distinguished.
In this paper, we propose $\textbf{Yet Another Policy Optimization (YaPO)}$, a $\textbf{reference-free}$ method that learns $\textbf{sparse steering vectors}$ in the latent space of a $\textbf{Sparse Autoencoder (SAE)}$.
By optimizing sparse codes, YaPO produces disentangled, interpretable, and efficient steering directions.
Empirically, we show that sparse steering vectors converge faster, achieve lower training and evaluation loss, and remain more stable throughout training compared to dense counterparts.
Beyond cultural alignment, YaPO generalizes to diverse alignment-related behaviors studied in BiPO, including truthfulness, hallucination mitigation, and jailbreak defense.
Our results demonstrate that YaPO sparse steering provides a general recipe for efficient, stable, and fine-grained alignment of LLMs, with broad implications for controllability and domain adaptation.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 19998
Loading