Keywords: reasoning, synthetic tasks, heuristics solvers, MCTS, GRPO
Abstract: Complex reasoning---central to intelligent behavior---demands capabilities beyond pretrained knowledge in large language models (LLMs). Prevailing efforts to improve LLM reasoning often bootstrap from predefined question--answer pairs, selecting high-quality traces to guide self-improvement, which does not scale due to the need for curated problems and solutions. We address this limitation by introducing VAST: Reasoning on Auto-verifiable, Scalable, multi-step synthetic Tasks. VAST enables scalable improvement through structured task that grow in difficulty and allow algorithmic verification of both final answers and intermediate steps, substantially minimizing costly human annotation. We present two complementary methods that leverage VAST: (1) $\nu$GRPO, an online approach that constructs rewards from solver outputs and shows generalization to out-of-distribution tasks; and (2) $\nu$MCTS, a Monte Carlo Tree Search method that derives intermediate rewards from rollout-based solution discovery, thereby enabling self-improvement without human annotation. Together, VAST and these methods provide a practical path to robust, scalable, and verifiable multi-step reasoning in modern LLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13585
Loading