Keywords: Chain of Thought/Reasoning models, Steering, Probing
TL;DR: We propose ReflCtrl, a representation engineering framework that identifies and steers a reflection direction in LLM.
Abstract: Large language models (LLMs) with Chain-of-Thought (CoT) reasoning have achieved strong performance across diverse tasks, including mathematics, coding, and general reasoning. A distinctive ability of these reasoning models is **self-reflection**: the ability to review and revise previous reasoning steps. While self-reflection enhances the reasoning performance, it also increases inference cost. In this work, we study self-reflection through the lens of **representation engineering**. We segment model's reasoning into steps, identify those corresponding to reflection, and extract a reflection direction in the latent space that governs this behavior. Using this direction, we propose a stepwise steering method that can control reflection frequency. We call our framework ReflCtrl. Our experiments show that (1) for many cases the reflections are redundant, especially in stronger models. In our experiment, we can save up to 33.6\% while preserving the performance. (2) model's reflection behavior is highly correlated with internal uncertainty signal, implying self-reflection may be controlled by model's uncertainty.
Submission Number: 82
Loading