Keywords: Large Language Model, Reasoning, Tool Use, Reinforcement Learning
TL;DR: We address "Lazy Reasoning" in LRMs with D-CORE, combining self-distillation and diversity-aware RL. D-CORE-8B/14B achieve 77.7%/79.3% on BFCLv3
Abstract: Effective tool use and reasoning are essential capabilities for large reasoning models (LRMs) to address complex real-world problems. Through empirical analysis, we identify a prevalent "Lazy Reasoning" phenomenon, where LRMs frequently engage in repetitive and meaningless reflective reasoning. This occurs primarily due to their inadequate ability to decompose tasks when reasoning in complex tool use scenarios. To address this, we propose a two-stage training framework D-CORE ($\underline{\textbf{D}}$ecomposing tasks and $\underline{\textbf{Co}}$mposing $\underline{\textbf{Re}}$asoning processes) that first incentivize the LRM’s task decomposition reasoning capability via self-distillation, followed by diversity-aware reinforcement learning (RL) to restore LRM's reflective reasoning capability. D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales. Experiments on BFCLv3 demonstrate superiority of our method: D-CORE-8B reaches 77.7\% accuracy, surpassing the best-performing 8B model by 5.7\%. Meanwhile, D-CORE-14B establishes a new state-of-the-art at 79.3\%, outperforming 70B models despite being 5× smaller.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 3614
Loading