Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features

ICLR 2026 Conference Submission12663 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mechanistic Interpretability, AI Steering, Reinforcement Learning, RL, AI Control, PPO, Representation Learning, Sparse Autoencoder
Abstract: Large language models exhibit emergent misalignment behaviors during test-time generation, necessitating dynamic control mechanisms for safe deployment. Inspired by sparse interpretable representations, sparse autoencoders (SAEs) can disentangle monosemantic features from superpositioned dense activations, offering a natural interface for controlling language model behavior through interpretable feature manipulation. This work introduces Control Reinforcement Learning (CRL), a framework that trains policy networks to dynamically select task-relevant SAE features through reward-based feedback. CRL enables interpretable performance tracking by isolating feature contributions at each generation step, revealing which features drive improvements across diverse benchmarks including question answering, bias mitigation, and reasoning tasks. To balance exploration and exploitation, the framework employs Adaptive Feature Masking (AFM) to encourage diverse yet effective feature discovery while maintaining interpretability. Through token-wise feature analysis, CRL provides mechanistic insights into model behavior while achieving modest performance improvements on Gemma-2 2B at task-optimal layers: MMLU (+3.29%), GSM8K (+1.14%), BBQ bias mitigation (+3.55%), and HarmBench safety (+5.61%). Results demonstrate that interpretable steering can serve as both a performance enhancement and analysis tool, establishing a practical pathway for controllable AI systems.
Primary Area: interpretability and explainable AI
Submission Number: 12663
Loading