Abstract: Variational Autoencoder (VAE), compressing videos into latent representations, is a crucial preceding component of Latent Video Diffusion Models (LVDMs). However, most LVDMs utilize 2D image VAE, which only compresses video spatially. This will lead to temporally redundant representations, reducing the efficiency of LVDMs. To eliminate this issue, we propose an omni-dimension compression VAE, named OD-VAE, which can temporally and spatially compress videos based on 3D-Causal-CNN architecture. To obtain a better trade-off between video reconstruction quality and compression speed, we further introduce and analyze four model variants of OD-VAE. In addition, a novel initialization method is designed to train our OD-VAE more efficiently, and a novel inference strategy is proposed to enable OD-VAE to handle videos of arbitrary length with limited GPU memory. Comprehensive experiments on video reconstruction and LVDM-based video generation demonstrate the effectiveness and efficiency of our proposed methods. The source code and models are available at here.
External IDs:dblp:conf/icmcs/ChenLLZWYZCY25
Loading