Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features

ICLR 2026 Conference Submission12663 Authors

18 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mechanistic Interpretability, AI Steering, Reinforcement Learning, RL, AI Control, PPO, Representation Learning, Sparse Autoencoder
Abstract: Large language models exhibit emergent misalignment behaviors during test-time generation, necessitating dynamic control mechanisms for safe deployment. Inspired by sparse interpretable representations, sparse autoencoders (SAEs) can disentangle monosemantic features from superpositioned dense activations, offering a natural interface for controlling language model behavior through interpretable feature manipulation. This work introduces Control Reinforcement Learning (CRL), a framework to unify reinforcement learning with SAE features for interpretable token-level language model control. CRL enables interpretable branch tracking by isolating feature contributions at each generation step, revealing which features drive behavior changes across diverse benchmarks including question answering, bias mitigation, and reasoning tasks. To balance exploration and exploitation, the framework employs Adaptive Feature Masking (AFM) to encourage diverse yet effective feature discovery while maintaining interpretability. Through token-wise feature analysis, CRL provides mechanistic insights into model behavior, revealing task-specific feature contributions across diverse benchmarks including question answering, bias mitigation, and safety tasks. The framework is compatible with supervised fine-tuning, providing complementary control when applied to SFT models. Results demonstrate that interpretable steering can serve as both a control method and analysis tool, establishing a practical pathway for controllable AI systems.
Primary Area: interpretability and explainable AI
Submission Number: 12663
Loading