Keywords: Markovian Transformers, chain-of-thought reasoning, language model interpretability, causal reasoning, reinforcement learning, next-token prediction, GSM8K, large language models
TL;DR: Markovian Transformers make Llama-3.1-8B emit causally necessary token-level chains of thought, boosting GSM8K accuracy by 34 pp and revealing self-explanatory reasoning that guides its own future token predictions.
Abstract: Chain-of-Thought (CoT) reasoning often fails to faithfully reflect a language model's underlying decision process. We address this by introducing a Markovian language model framework that can be understood as a reasoning autoencoder: it creates a text-based bottleneck where CoT serves as an intermediate representation, forcing the model to compress essential reasoning into interpretable text before making predictions. We train this system with a GRPO-style policy gradient algorithm using parallel sampling, a frozen baseline CoT', within-batch standardized advantages, and actor-reward (chain-rule) gradients. Our approach yields large gains on QA tasks (e.g., GSM8K: 20.7% to 54.5%; +33.8 pp; ARC-Challenge: 47.5% to 76.9%; +29.4 pp). Perturbation analyses across types and severities show consistently higher sensitivity to CoT edits (typically 52%--82% of cases favor Markovian), indicating stronger causal reliance on the CoT. Cross-model evaluation confirms that learned CoTs generalize across architectures, suggesting they capture transferable reasoning patterns rather than model-specific artifacts.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 22558
Loading