Planning at Inference: MCTS Test-Time Scaling for Long Video Generation

Ritvik Bale; Ethan He; Ashwath Aithal; Linnan Wang

Planning at Inference: MCTS Test-Time Scaling for Long Video Generation

Ritvik Bale, Ethan He, Ashwath Aithal, Linnan Wang

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Generation, Long Video Generation, Test Time Scaling, Monte Carlo Tree Search

TL;DR: We leverage Monte Carlo Tree Search–based test-time scaling to select better continuations, enabling the generation of coherent long videos.

Abstract: Generating long videos with consistent content and visual quality remains a ma- jor challenge, as existing one-shot and chunked methods often suffer from se- mantic drift and compounding artifacts. We explore Test-Time Scaling (TTS) as a framework for long video generation, formulating the task as a sequential decision-making problem. Our approach uses Monte Carlo Tree Search (MCTS) to evaluate multiple continuations with look-ahead rollouts and backpropagated rewards, and we introduce a Multi-Tree MCTS variant that improves exploration in continuous generation spaces. The method is modular and can be applied to ex- isting backbones without retraining. Experiments on Cosmos-Predict2 and other models show consistent improvements in object permanence, temporal coherence, and text-video alignment over Best-of-N, Greedy, and Beam search. Furthermore, our method produces high-quality videos exceeding 20 seconds, surpassing the output of leading models like Sora and Kling by 18% and 47% respectively, all while maintaining comparable visual fidelity. Although the results are limited by the quality of current generators and verifiers, our study highlights both the promise of search-based TTS and the limitations of today’s video generation and evaluation models.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 21926

Loading