Keywords: reasoning, mechanistic interpretability, reinforcement learning, Group Relative Policy Optimization (GRPO), RLHF, small language models, interpretability, sparse autoencoders (SAEs), feature circuits, representation dynamics, symmetry generalization, world models, games as benchmarks, OOD robustness, chain-of-thought, evaluation
TL;DR: Our study shows RLHF with GRPO improves LLM accuracy, but SAE probes reveal reliance on surface patterns. We map representation shifts across training and introduce a hypothesis-driven interpretability pipeline.
Abstract: Large language models (LLMs) are increasingly described as acquiring "reasoning" skills after reinforcement learning from human feedback (RLHF) or related alignment methods. Benchmark improvements are widely celebrated as progress toward higher-order reasoning. However, whether these gains reflect genuine structural reasoning or more superficial adaptations remains underexplored. In this work, we probe LLMs trained in a finite and exhaustively analyzable logical domain, namely **Tic-Tac-Toe**, and trace how internal representations evolve across reinforcement learning with Group Relative Policy Optimization (GRPO). Quantitatively, reinforcement learning improves models far more than supervised fine-tuning (SFT), yielding higher accuracy and robustness across prompt variations. Mechanistic interpretability, however, paints a different picture: features extracted with sparse autoencoders (SAEs) reveal that models primarily adapt to better extract and exploit information already explicit in the prompt, such as whose turn it is, game progression and board occupancy.
By contrast, high-level concepts like board symmetries, strategic forks and guaranteed wins remain weakly represented, echoing concerns that reasoning benchmarks risk overstating abstraction. This tension between surface-level performance and deeper representational change suggests that RLHF-driven ``reasoning'' may be conflating task-specific updates with structural reasoning ability. Our contribution is three-fold: (i) a systematic interpretability pipeline **tracing representation dynamics for the first time** across RL training in LLMs; (ii) an extension of SAE-based feature discovery to hypothesis-driven testing in a finite logical domain; and (iii) **the first interpretability based demonstration** that reinforcement learning amplifies prompt-level feature use rather than developing higher-order (game) reasoning. These findings argue for interpretability-first evaluation of reasoning claims, aligning with broader calls to ground reasoning in mechanistic analysis.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 8025
Loading