The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

ICLR 2026 Conference Submission14414 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Mechanistic Interpretability, Large Language Models, Steering

TL;DR: We created a transparent AI alignment tool and discovered that models learn to satisfy human preferences by making their outputs more stylish and well-formatted, rather than by using more 'honesty' or 'safety' concepts

Abstract: Prevailing alignment methods induce opaque parameter changes, making it difficult to audit what the model truly learns. To address this, we introduce Feature Steering with Reinforcement Learning (FSRL), a framework that trains a lightweight adapter to steer model behavior by modulating interpretable sparse features. First, we theoretically show that this mechanism is principled and expressive enough to approximate the behavioral shifts of post-training processes. Then, we apply this framework to the task of preference optimization and per- form a causal analysis of the learned policy. We find that the model relies on stylistic presentation as a proxy for quality, disproportionately steering features related to style and formatting over those tied to alignment concepts like honesty. Despite exploiting this heuristic, FSRL proves to be an effective alignment method, achieving a substantial reduction in preference loss. Overall, FSRL offers an interpretable control interface and a practical way to diagnose how preference optimization pressures manifest at the feature level.

Primary Area: interpretability and explainable AI

Submission Number: 14414

Loading