Keywords: reasoning, efficiency
TL;DR: We measure the tradeoff between reasoning length and accuracy and establish upper bounds on the tradeoff by computing token complexities for each question.
Abstract: Chain-of-thought prompting has emerged as a powerful technique for enabling large language models (LLMs) to solve complex reasoning tasks. However, these reasoning chains can be verbose, raising concerns about efficiency. In this work, we conduct the first systematic study of the relationship between reasoning length and model performance across a diverse range of compression instructions. We discover a universal tradeoff between reasoning length and accuracy that persists across even very distinct reasoning chains. We demonstrate that this tradeoff emerges from a sharp threshold behavior at the question level: each task has an intrinsic "token complexity" – a minimal number of tokens required for successful problem-solving. We use token complexity to compute upper bounds on the optimal accuracy-compression tradeoff. Our analysis reveals that prompt-based compression strategies operate far from these theoretical limits, suggesting significant room for improvement and providing benchmarks to help researchers evaluate progress in reasoning efficiency.
Submission Number: 98
Loading