Keywords: Chain of Thought Reasoning, Reinforcement Learning, Scalable Oversight, Language Modeling, Proximal Policy Optimization
TL;DR: We introduce "Markovian" RL-training that ensures that LM Chain-of-Thought is causally load-bearing, while improving performance on many-step arithmetic and GSM8K over baselines, thereby advancing more transparent and interpretable AI reasoning.
Abstract: Chain-of-Thought (CoT) reasoning holds great promise for explaining language model outputs, but recent studies have highlighted significant challenges in its practical application for interpretability. We propose to address this issue by making CoT causally essential to prediction through two key components: factoring next-token prediction through intermediate CoT text, and training CoT to predict future tokens independently of other context. This results in "Markovian" language models, where CoT serves as a fixed-size state for future token prediction. Our approach optimizes for "informativeness" – the improvement in next-token predictions using a trained CoT compared to a baseline. Using Proximal Policy Optimization (PPO) for arithmetic problems and policy gradient for GSM8K, we demonstrate effectiveness on both arithmetic problems with Mistral 7B and the GSM8K benchmark with Llama 3.1 8B, where the model learns to produce CoTs that are 33.20% more effective at predicting answers than the pre-trained baseline. The increased sensitivity of model performance to CoT perturbations provides strong evidence of CoT reliance. Furthermore, we show that CoTs trained for one model generalize to help other models predict answers, suggesting these CoTs capture reasoning patterns that transfer across different interpreters. This work advances the development of more interpretable language models, potentially enabling their extension to arbitrarily long contexts and enhancing AI reasoning capabilities across various domains.
Supplementary Material:  zip
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11801
Loading