Keywords: Reinforcement Learning, Mechanistic Interpretability, Large Language Models, Steering
TL;DR: We created a transparent AI alignment tool and discovered that models learn to satisfy human preferences by making their outputs more stylish and well-formatted, rather than by using more 'honesty' or 'safety' concepts
Abstract: Prevailing alignment methods induce opaque parameter changes, making it difficult to audit what the model truly learns. To address this, we introduce Feature Steering with Reinforcement Learning (FSRL), a framework that trains a lightweight adapter to steer model behavior by modulating interpretable sparse features. First, we theoretically show that this mechanism is principled and expressive enough to approximate the behavioral shifts of post-training processes. Then, we apply this framework to the task of preference optimization and per-
form a causal analysis of the learned policy. We find that the model relies on stylistic presentation as a proxy for quality, disproportionately steering features related to style and formatting over those tied to alignment concepts like honesty. Despite exploiting this heuristic, FSRL proves to be an effective alignment method, achieving a substantial reduction in preference loss. Overall, FSRL offers an interpretable control interface and a practical way to diagnose how preference optimization pressures manifest at the feature level.
Primary Area: interpretability and explainable AI
Submission Number: 14414
Loading