Tracing LLM Reasoning Processes with Strategic Games: A Framework for Resource-Constrained Decision Making and Revision

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Reasoning Evaluation
TL;DR: We propose a game-based framework to evaluate LLMs on how they revise and decide under constraints, revealing process-level strengths and weaknesses beyond final outcomes.
Abstract: Large language models (LLMs) are increasingly applied to tasks that require complex reasoning. While most benchmarks focus on evaluating final reasoning outcomes, they overlook the internal processes that lead to those outcomes—such as how a model revises and makes decisions under constraints. We argue that evaluating these internal reasoning steps is essential for understanding model behavior and improving reliability in real-world applications. To make these processes observable and measurable, we propose using strategic games as a natural and effective environment. These games operate within closed, rule-based systems and provide interpretable states, limited resources, and automatic feedback. Therefore, we propose a framework to evaluate LLMs along two core process dimensions: resource-constrained decision making and revision. To support this, we introduce a set of evaluation metrics that extend beyond traditional Win Rate(WR), incorporating measures such as Over-Correction Risk Rate (ORR), Correction Success Rate (CSR), Improvement Slope ($\beta$), and Over-Budget Rate (OBR). In a set of 4320 adversarial rounds across 12 state-of-the-art models, we find that ChatGPT-o3-mini, which demonstrates strong reasoning capabilities, achieves the strongest results across process metrics(74.7\% WR, 78.6\% CSR, and $\beta=+0.041$). In contrast, Qwen-Plus, despite a high ORR of 81.6\%, achieves only a 25.6\% win rate, primarily due to excessive resource use. We observe a consistent negative trend between ORR and CSR, suggesting that more frequent corrections do not always improve outcomes. This pattern may reflect high-frequency revisions made prematurely, which often reduce overall effectiveness, whereas more targeted revisions are associated with higher accuracy. We hope this work offers a new direction for LLM evaluation—focusing not just on what models decide, but on how they decide it.
Primary Area: datasets and benchmarks
Submission Number: 22744
Loading