Keywords: Video Generation, Long Video Generation, Test Time Scaling, Monte Carlo Tree Search
TL;DR: We leverage Monte Carlo Tree Search–based test-time scaling to select better continuations, enabling the generation of coherent long videos.
Abstract: Generating long videos with consistent content and visual quality remains a ma-
jor challenge, as existing one-shot and chunked methods often suffer from se-
mantic drift and compounding artifacts. We explore Test-Time Scaling (TTS)
as a framework for long video generation, formulating the task as a sequential
decision-making problem. Our approach uses Monte Carlo Tree Search (MCTS)
to evaluate multiple continuations with look-ahead rollouts and backpropagated
rewards, and we introduce a Multi-Tree MCTS variant that improves exploration
in continuous generation spaces. The method is modular and can be applied to ex-
isting backbones without retraining. Experiments on Cosmos-Predict2 and other
models show consistent improvements in object permanence, temporal coherence,
and text-video alignment over Best-of-N, Greedy, and Beam search. Furthermore,
our method produces high-quality videos exceeding 20 seconds, surpassing the
output of leading models like Sora and Kling by 18% and 47% respectively, all
while maintaining comparable visual fidelity. Although the results are limited
by the quality of current generators and verifiers, our study highlights both the
promise of search-based TTS and the limitations of today’s video generation and
evaluation models.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 21926
Loading