Systematic Exploration Supervision Enables Scaling Beyond Training Complexity

Systematic Exploration Supervision Enables Scaling Beyond Training Complexity

ICLR 2026 Conference Submission21365 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Planning, Algorithm distillation, reasoining, supervised learning

Abstract: Language models trained on input--output pairs or linear Chain-of-Thought (CoT) traces often fail when task complexity at test time exceeds the regime seen during training; reinforcement learning methods can help but suffer from cold-start brittleness when base accuracy is low. We introduce Systematic Exploration Supervision (SES), a process-level supervision framework that verbalizes \emph{complete multi-branch search traces} (sampling alternatives, propagating outcomes, and backtracking to extract a solution) rather than a single reasoning chain. In textualized Gridworld, SES preserves 76.5\% success when scaling from 10×10 training environments to unseen 20×20 grids (vs. 19.0\% for standard supervised fine-tuning, 26.0\% for inference-time Tree-of-Thought, and 6.0\% for GRPO). We further extend SES to open domains via a bootstrapped trace construction procedure that guarantees inclusion of at least one valid solution while adding diverse, reward-prioritized alternatives. Results show substantial improvements on combinatorial reasoning (Game of 24: 47\% vs. 17\% best baseline) and competitive performance on logical reasoning (ProntoQA: 100\%), with task-dependent effectiveness patterns. We demonstrate that SES behavior cannot be induced with few-shot prompting alone, even with sophisticated models like GPT-o1, suggesting in-weight algorithmic policy acquisition. Remarkably, our approach achieves 14× parameter efficiency, with a 0.5B model outperforming 7B baselines. We characterize when SES is advantageous (large branching factors, low base competence, scaling demands) and discuss limitations (token length inflation, effectiveness when base models are already competent). Our findings highlight full-search verbalization as a simple, offline alternative to inference-time search or costly RL for scaling systematic reasoning.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 21365

Loading