From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement

From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement

ACL ARR 2025 May Submission2469 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Chain-of-Thought (CoT) reasoning improves performance on complex tasks but introduces significant inference latency due to its verbosity. In this work, we propose Multiround Adaptive Chain-of-Thought Compression (\textbf{MACC}), a framework that leverages the \textit{token elasticity phenomenon}—where overly small token budgets may paradoxically increase output length—to progressively compress CoTs via multiround refinement. This adaptive strategy allows MACC to dynamically determine the optimal compression depth for each input. Our method achieves an average accuracy improvement of 5.6$\%$ over state-of-the-art baselines, while also reducing CoT length by an average of 47 tokens and significantly lowering latency. Furthermore, we show that \textbf{test-time performance}—accuracy and token length—can be reliably predicted using interpretable features like perplexity and compression rate \textbf{on training set}. Evaluated across different models, our method enables efficient model selection and forecasting without repeated fine-tuning, demonstrating that CoT compression is both effective and predictable. Our code will be released to facilitate reproducibility and future research in CoT compression.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: LLM Efficiency, parameter-efficient-training

Contribution Types: Model analysis & interpretability, Reproduction study, Approaches low compute settings-efficiency, Data analysis

Languages Studied: English

Submission Number: 2469

Loading