Keywords: Chain-of-Thought reasoning, Direct Preference Optimization, Process supervision, Twisted Sequential Monte Carlo, Large language models
Abstract: Inference-time scaling enhances a model’s reasoning by extending its chain-of-thought (CoT). However, existing approaches typically rely on a single policy trained with outcome-reward reinforcement learning (RL), which often suffers from long-horizon plan failures, i.e., the implicit plan drifts away from any valid strategy. This problem is particularly severe for smaller language models (LMs) with long CoTs due to their limited capacity. To address this, we propose Multi-Level Reasoning (MLR), which reformulates long-CoT generation as a two-level stochastic process. Specifically, MLR employs two policies: a high-level planner that generates step descriptors (abstract subgoals) and a low-level executor that produces detailed content conditioned on these descriptors. The planner then generates the next subgoal based on the summarized current step, forming an alternating plan–execute loop. To maintain scalability, we adopt a minimal design, where the base model serves as the low-level policy and a lightweight LoRA module implements the high-level policy. For training, we observe that outcome-reward RL is inefficient and weakly informative for long trajectories (e.g., those exceeding 4K tokens). To overcome this, we introduce online Step-DPO, a process-level preference optimization scheme that leverages Twisted Sequential Monte Carlo (TSMC) to provide scalable stepwise supervision. This yields more effective training, improved stability, and higher accuracy. Extensive experiments on challenging math, science, and logical reasoning benchmarks show that, with only 10% SFT data and 5% of preference data, MLR outperforms both the DeepSeek-R1 distillation and the outcome-reward RL baselines across multiple base models and tasks. More importantly, MLR exhibits slower performance degradation on long-horizon reasoning, demonstrating stronger robustness under extended CoT generation.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24318
Loading