Keywords: Chain of Thought Reasoning, Large Language Model, Overthinking
Abstract: Chain-of-thought (CoT) reasoning enhances the multi-step reasoning capabilities of large language models (LLMs) by breaking complex tasks into smaller, manageable sub-tasks. Researchers have been exploring ways to guide models to generate more complex CoT processes to improve the reasoning ability of LLMs, such as long CoT and the test-time scaling law. However, for most models and tasks, does an increase in CoT length consistently lead to improved reasoning accuracy?
In this paper, we observe a nuanced relationship: as the number of reasoning steps increases, performance initially improves but eventually decreases. To understand this phenomenon, we provide a piece of evidence that *longer reasoning processes are increasingly susceptible to noise.* We theoretically prove the existence of an optimal reasoning step number and derive a scaling law for this optimal CoT length based on model capability and task difficulty. Inspired by our theory, we propose length-aware majority voting to alleviate the effects of excessively long or short CoTs, which is verified on both synthetic and real-world datasets. Our findings highlight the critical need to calibrate CoT length to align with model capabilities and task demands, offering a principled framework for optimizing multi-step reasoning in LLMs.
Submission Number: 126
Loading