How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach

Ayeong Lee; Ethan Che; Tianyi Peng

How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach

Ayeong Lee, Ethan Che, Tianyi Peng

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reasoning, efficiency

TL;DR: We measure the tradeoff between reasoning length and accuracy and establish upper bounds on the tradeoff by computing token complexities for each question.

Abstract: Chain-of-thought prompting has emerged as a powerful technique for enabling large language models (LLMs) to solve complex reasoning tasks. However, these reasoning chains can be verbose, raising concerns about efficiency. In this work, we conduct the first systematic study of the relationship between reasoning length and model performance across a diverse range of compression instructions. We discover a universal tradeoff between reasoning length and accuracy that persists across even very distinct reasoning chains. We demonstrate that this tradeoff emerges from a sharp threshold behavior at the question level: each task has an intrinsic "token complexity" – a minimal number of tokens required for successful problem-solving. We use token complexity to compute upper bounds on the optimal accuracy-compression tradeoff. Our analysis reveals that prompt-based compression strategies operate far from these theoretical limits, suggesting significant room for improvement and providing benchmarks to help researchers evaluate progress in reasoning efficiency.

Submission Number: 98

Loading