Planning at Inference: MCTS Test-Time Scaling for Long Video Generation

ICLR 2026 Conference Submission21926 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Generation, Long Video Generation, Test Time Scaling, Monte Carlo Tree Search
TL;DR: We leverage Monte Carlo Tree Search–based test-time scaling to select better continuations, enabling the generation of coherent long videos.
Abstract: Generating long videos with consistent content and visual quality remains a ma- jor challenge, as existing one-shot and chunked methods often suffer from se- mantic drift and compounding artifacts. We explore Test-Time Scaling (TTS) as a framework for long video generation, formulating the task as a sequential decision-making problem. Our approach uses Monte Carlo Tree Search (MCTS) to evaluate multiple continuations with look-ahead rollouts and backpropagated rewards, and we introduce a Multi-Tree MCTS variant that improves exploration in continuous generation spaces. The method is modular and can be applied to ex- isting backbones without retraining. Experiments on Cosmos-Predict2 and other models show consistent improvements in object permanence, temporal coherence, and text-video alignment over Best-of-N, Greedy, and Beam search. Furthermore, our method produces high-quality videos exceeding 20 seconds, surpassing the output of leading models like Sora and Kling by 18% and 47% respectively, all while maintaining comparable visual fidelity. Although the results are limited by the quality of current generators and verifiers, our study highlights both the promise of search-based TTS and the limitations of today’s video generation and evaluation models.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 21926
Loading