VC-VAE: Enhancing Video VAE with Video Codec Standard for Latent Video Diffusion Model

Xinxu Ge; Shang Chai; Litong Gong; Zitong YU; Xin Liu; Tiezheng Ge

VC-VAE: Enhancing Video VAE with Video Codec Standard for Latent Video Diffusion Model

Xinxu Ge, Shang Chai, Litong Gong, Zitong YU, Xin Liu, Tiezheng Ge

01 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video VAE

Abstract: Video Variational Auto-Encoders (Video VAEs) compress video data from the highly redundant pixel space into a compact latent representation, playing an important role in state-of-the-art video generation models. However, existing methods typically learn inter-frame correlations implicitly, overlooking the potential of breaking down video compression into two separate parts: keyframe encoding and inter-frame dynamic encoding, which is a fundamental design of traditional video codecs. To address this, we incorporate traditional video codec standard design into the Video VAE and introduce VC-VAE, a model that explicitly separates keyframe and inter-frame dynamic compression. We start by establishing a high-fidelity static keyframe anchor through initialization from a powerful pre-trained image VAE. Then, to explicitly model dynamic relative to this anchor, we introduce the Temporal Dynamic Difference Convolution (TDC), an operator designed to learn sparse motion residuals from inter-frame differences while maintaining a separate pathway for static content. Qualitative and quantitative experiments show that our proposed VC-VAE significantly outperforms baseline models in reconstruction quality, dynamic modelling, and training efficiency.

Primary Area: generative models

Submission Number: 162

Loading