Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing

Kaifeng Gao; Jiaxin Shi; Hanwang Zhang; Chunping Wang; Jun Xiao; Long Chen

Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing

Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao, Long Chen

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: With the advance of diffusion models, today's video generation has achieved impressive quality. To extend the generation length and facilitate real-world applications, a majority of video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent clips conditioned on the last frame(s) of the previous clip. However, existing autoregressive VDMs are highly inefficient and redundant: The model must re-compute all the conditional frames that are overlapped between adjacent clips. This issue is exacerbated when the conditional frames are extended autoregressively to provide the model with long-term context. In such cases, the computational demands increase significantly (i.e., with a quadratic complexity w.r.t. the autoregression step). In this paper, we propose **Ca2-VDM**, an efficient autoregressive VDM with **Ca**usal generation and **Ca**che sharing. For **causal generation**, it introduces unidirectional feature computation, which ensures that the cache of conditional frames can be precomputed in previous autoregression steps and reused in every subsequent step, eliminating redundant computations. For **cache sharing**, it shares the cache across all denoising steps to avoid the huge cache storage cost. Extensive experiments demonstrated that our Ca2-VDM achieves state-of-the-art quantitative and qualitative video generation results and significantly improves the generation speed. Code is available: https://github.com/Dawn-LX/CausalCache-VDM

Lay Summary: Nowadays, video synthesis technology has achieved impressive results, thanks to a technology called "video diffusion models" (VDMs). Each video frame is synthesized through multiple iterations following the diffusion mechanism. Current methods generate videos in short clips, using previously generated clips to create new ones. However, these approaches are slow and repetitive. Because they wasted too much time recalculating generated frames (overlapped frames between chunks) when using them as references. Our paper introduces Ca2-VDM, a new method designed to make video generation faster and more efficient. It uses two key ideas: **causal generation** and **cache sharing**. Causal generation means each frame is computed based only on the frames that came before it, so frame information can be pre-calculated and stored for future use. Cache sharing allows the method to reuse this stored information (i.e., cache) throughout all diffusion iterations, helping to use much less computer memory. With these innovations, our method speeds up video synthesis and reduces the amount of computer memory needed, all while maintaining the quality of the videos produced.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/Dawn-LX/CausalCache-VDM

Primary Area: Applications->Computer Vision

Keywords: Video Generation, Video Diffusion Model

Submission Number: 10903

Loading