Open Source Links: github.com/Jazhyc/feature-steering-RL
Keywords: Sparse Autoencoders, Reinforcement learning, Steering
Other Keywords: Applications of Interpretability, Automated Interpretability
TL;DR: We created a transparent AI alignment tool and discovered that models learn to satisfy human preferences by making their outputs more stylish and well-formatted, rather than by using more 'honesty' or 'safety' concepts.
Abstract: Prevailing alignment methods induce opaque parameter changes, making it difficult to audit what the model truly learns. To address this, we introduce Feature Steering with Reinforcement Learning (FSRL), a framework that trains a lightweight adapter to steer model behavior by modulating interpretable sparse features. First, we theoretically show that this mechanism is principled and expressive enough to approximate the behavioral shifts of post-training processes. Then, we apply this framework to the task of preference optimization and perform a causal analysis of the learned policy. We find that the model relies on stylistic presentation as a proxy for quality, disproportionately steering features related to style and formatting over those tied to alignment concepts like honesty. Despite exploiting this heuristic, FSRL proves to be an effective alignment method, achieving a substantial reduction in preference loss. Overall, FSRL offers an interpretable control interface and a practical way to diagnose how preference optimization pressures manifest at the feature level.
Submission Number: 159
Loading