Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm

Kaisen Yang; Lixuan He; Rushi Shah; Kaicheng Yang; Qinwei Ma; Dianbo Liu; Alex Lamb

Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm

Kaisen Yang, Lixuan He, Rushi Shah, Kaicheng Yang, Qinwei Ma, Dianbo Liu, Alex Lamb

10 Sept 2025 (modified: 01 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Chain of Thought, Test Time Scaling, LLM Reasoning

TL;DR: We propose E$^2$C, a two-phase reasoning paradigm that decouples exploration from execution, delivering higher efficiency, stronger generalization, and improved interpretability.

Abstract: Chain-of-Thought (CoT) and its variants have markedly advanced the reasoning abilities of Large Language Models (LLMs), yet their monolithic and auto-regressive architecture inherently conflates high-level strategic planning with low-level step-by-step execution, leading to computational inefficiency, limited exploration of reasoning paths, and reduced interpretability. To overcome these issues, we propose the Explore-Execute Chain (E$^2$C), a structured reasoning framework that decouples reasoning into two distinct phases: an exploratory phase that stochastically generates succinct high-level plans, followed by an execution phase that deterministically carries out the chosen plan. Our approach incorporates a two-stage training methodology, which combines Supervised Fine-Tuning (SFT)—augmented by a novel data generation algorithm enforcing strict plan adherence—with a subsequent Reinforcement Learning (RL) stage that capitalizes on the informativeness of exploration and reinforces the determinism of execution. This decomposition enables an efficient test-time scaling strategy: on AIME’2024, E$^2$C Test Time Scaling reaches 58.1% accuracy using <10% of the decoding tokens required by comparable methods (e.g., Forest-of-Thought), sharply cutting self-consistency overhead. For cross-domain adaptation, our Exploration-Focused SFT (EF-SFT) fine-tunes with only 3.5% of the tokens used by standard SFT yet yields up to 14.5% higher accuracy than standard SFT on medical benchmarks, delivering state-of-the-art performance, strong generalization, and greater interpretability by separating planning from execution.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 3676

Loading