Open Source Links: github.com/Jazhyc/feature-steering-RL
Keywords: Sparse Autoencoders, Reinforcement learning, Steering
Other Keywords: Applications of Interpretability, Automated Interpretability
TL;DR: We created a transparent AI alignment tool and discovered that models learn to satisfy human preferences by making their outputs more stylish and well-formatted, rather than by using more 'honesty' or 'safety' concepts.
Abstract: Prevailing alignment methods induce opaque parameter changes, obscuring what models truly learn. To address this, we introduce Feature Steering with Reinforcement Learning (FSRL), a framework that trains a lightweight adapter to steer model behavior by modulating interpretable sparse features. First, we theoretically demonstrate that this mechanism is expressive enough to approximate the behavioral shifts of post-training processes. We then apply FSRL to preference optimization and perform a causal analysis of the learned policy. Our analysis reveals a crucial insight: the model learns to reward stylistic presentation as a proxy for quality, disproportionately relying on features related to style and formatting over those tied to alignment concepts like honesty. By effectively optimizing the preference objective, FSRL serves as a transparent proxy for observing the alignment process. Overall, FSRL offers an interpretable control interface and a practical way to diagnose how preference optimization pressures manifest at the feature level.
Submission Number: 159
Loading