MuTIS: Enhancing Reasoning Efficiency through Multi Turn Intervention Sampling in Reinforcement Learning

MuTIS: Enhancing Reasoning Efficiency through Multi Turn Intervention Sampling in Reinforcement Learning

ACL ARR 2025 May Submission1376 Authors

17 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Long chain-of-thought (CoT) reasoning have recently attracted significant attention, with models such as DeepSeek-R1 achieving remarkable performance across various reasoning benchmarks. However, a common challenge to these models is the "overthinking" problem, leading to excessive intermediate steps and diminished inference efficiency. While numerous efforts have targeted reduction in generated tokens, these frequently encounter an inherent trade-off: enhancements in efficiency often come at the cost of degradation in performance. To overcome such challenges, we introduce the Multi-Turn Intervention Sampling Framework (MuTIS). Our framework leverages multi-turn interventions within rollouts to produce high-quality, concise reasoning chains. It fine-tunes reasoning models through reinforcement learning, demonstrably surpassing the previously described accuracy-efficiency trade-off. Through extensive experiments on challenging mathematical reasoning benchmarks, our approach achieves a substantial 11.3% improvement in accuracy while concurrently reducing token utilization by an average of 60.1%. Code, data, and models will be fully open-sourced.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: reinforcement learning, reasoning efficiency, multi-turn rollout

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Data analysis

Languages Studied: English

Submission Number: 1376

Loading