The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models

Siqi Fan; Bowen Qin; Peng Han; Shuo Shang; Yequan Wang; Aixin Sun

The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models

Siqi Fan, Bowen Qin, Peng Han, Shuo Shang, Yequan Wang, Aixin Sun

18 Sept 2025 (modified: 08 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reasoning Efficiency, Test-time Scaling, Large Language Models, Chain-of-Thought

TL;DR: We formalize reasoning efficiency to evaluate thinking models, discover potential scaling laws showing systematic overthinking on simple problems, and propose CoThink to adaptively scale computation with problem complexity.

Abstract: Recent thinking models trained with reinforcement learning and backwardchecking CoT often suffer from overthinking: they produce excessively long outputs even on simple problems, wasting computation. Existing evaluations, based on token efficiency, give an incomplete view as they neglect problem difficulty and intermediate computation costs. We formalize reasoning efficiency as a relative measure between thinking and instruct models, treating instruct models as the minimal-effort baseline. A systematic study across four thinking models and multiple benchmarks reveals two consistent patterns: (i) instruct models achieve higher efficiency overall, and (ii) problem difficulty affects efficiency, with thinking models wasting computation on easy problems but providing value on harder ones. Building on this insight, we propose COTHINK, a simple two-stage pipeline: an instruct model drafts a brief outline, and a thinking model expands it. On GSM8K, MATH500, and AIME24, COTHINK cuts token usage by 21.1% while keeping accuracy on four thinking models, and remains competitive with strong efficiency baselines.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 11280

Loading