Latent Reward-Guided Search for Faster Inference-Time Scaling in Video Diffusion

Zengqun Zhao; Ziquan Liu; Yu Cao; Shaogang Gong; Zhensong Zhang; Jifei Song; Jiankang Deng; Ioannis Patras

Latent Reward-Guided Search for Faster Inference-Time Scaling in Video Diffusion

Zengqun Zhao, Ziquan Liu, Yu Cao, Shaogang Gong, Zhensong Zhang, Jifei Song, Jiankang Deng, Ioannis Patras

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Generation, Diffusion Models, Reward Model, Inference-Time Scaling

Abstract: The recent success of inference-time scaling in large language models has inspired similar explorations in video diffusion. In particular, motivated by the existence of “golden noise” that enhances video quality, prior work has attempted to improve inference by optimising or searching for better initial noise. However, these approaches have notable limitations: they either rely on priors imposed at the beginning of noise sampling or on rewards evaluated only on the denoised and decoded videos. This leads to error accumulation, delayed and sparse reward signals, and prohibitive computational cost, which prevents the use of stronger search algorithms. Crucially, stronger search algorithms are precisely what could unlock substantial gains in controllability, sample efficiency and generation quality for video diffusion, provided their computational cost can be reduced. To fill in this gap, we enable efficient inference-time scaling for video diffusion through latent reward guidance, which provides intermediate, informative and efficient feedback along the denoising trajectory. We introduce a latent reward model that scores partially denoised latents at arbitrary timesteps with respect to visual quality, motion quality, and text alignment. Building on this model, we propose LatSearch, a novel inference-time search mechanism that performs Reward-Guided Resampling and Pruning (RGRP). In the resampling stage, candidates are sampled according to reward-normalised probabilities to reduce over-reliance on the reward model. In the pruning stage, applied at the final scheduled step, only the candidate with the highest cumulative reward is retained, improving both quality and efficiency. We evaluate LatSearch on the VBench-2.0 benchmark and demonstrate that it consistently improves video generation across multiple evaluation dimensions compared to the baseline Wan2.1 model. Compared with the state-of-the-art, our approach achieves comparable or better quality while reducing runtime by up to 79%.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 5754

Loading