UniVAE: A Unified Frame-Enriched Video VAE for Latent Video Diffusion Models

Zizheng Yang; Biao Gong; DanDan Zheng; Kecheng Zheng; Ziyuan Huang; Wen Wang; Jingdong Chen; Ming Yang; Feng Zhao

UniVAE: A Unified Frame-Enriched Video VAE for Latent Video Diffusion Models

Zizheng Yang, Biao Gong, DanDan Zheng, Kecheng Zheng, Ziyuan Huang, Wen Wang, Jingdong Chen, Ming Yang, Feng Zhao

23 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Variational Autoencoder, Unified, Interpolation, Latent Video Diffusion Models

TL;DR: UniVAE explores a unified way to compress videos both spatially and temporally with jointly designed encoder and decoder, equips with temporal compression and integrated frame interpolation.

Abstract: Variational Autoencoder (VAE) underscores its indispensable role along the growing prominence of Latent Video Diffusion Models (LVDMs). Nevertheless, current latent generative models are generally built upon image VAEs, which compress the spatial dimension only. While, it is vital for video VAE to model temporal dynamic patterns to produce smooth high quality video reconstruction. To address these issues, we propose UniVAE, which compresses videos both spatially and temporally while ensuring coherent video construction. Specifically, we employ 3D convolutions at varying scales in the encoder to temporally compress videos, enabling the UniVAE to capture dependencies across multiple time scales. Furthermore, existing VAEs only reconstruct videos at a low resolution and fps, bounded by limited GPU memory, which makes the entire video generation pipeline fragmented and complicated. Thus, in conjunction with the new encoder, we explore the potential of the VAE decoder to perform frame interpolation, aiming to synthesize additional intermediate frames without relying on standalone add-on interpolation models. Compared with existing VAEs, the proposed UniVAE explores a unified way to compress videos both spatially and temporally with jointly designed encoder and decoder, thus achieving accurate and smooth video reconstruction at a high frame rate. Extensive experiments on commonly used public datasets for video reconstruction and generation demonstrate the superiority of the proposed UniVAE. The code and the pre-trained models will be released to facilitate further research.

Supplementary Material: zip

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2763

Loading