BlockVid: Block Diffusion for High-Fidelity and Coherent Minute-Long Video Generation

BlockVid: Block Diffusion for High-Fidelity and Coherent Minute-Long Video Generation

ICLR 2026 Conference Submission21119 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Long Video Generation, Block Diffusion

Abstract: Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it still faces challenges such as error accumulation from KV caching over long sequences and the absence of suitable evaluation benchmarks. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with a semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated noise scheduling to reduce error propagation and enhance temporal consistency. Additionally, we introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics designed to evaluate long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that our approach consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the current state of the art.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 21119

Loading