everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Variational Autoencoder (VAE) underscores its indispensable role along the growing prominence of Latent Video Diffusion Models (LVDMs). Nevertheless, current latent generative models are generally built upon image VAEs, which compress the spatial dimension only. While, it is vital for video VAE to model temporal dynamic patterns to produce smooth high quality video reconstruction. To address these issues, we propose UniVAE, which compresses videos both spatially and temporally while ensuring coherent video construction. Specifically, we employ 3D convolutions at varying scales in the encoder to temporally compress videos, enabling the UniVAE to capture dependencies across multiple time scales. Furthermore, existing VAEs only reconstruct videos at a low resolution and fps, bounded by limited GPU memory, which makes the entire video generation pipeline fragmented and complicated. Thus, in conjunction with the new encoder, we explore the potential of the VAE decoder to perform frame interpolation, aiming to synthesize additional intermediate frames without relying on standalone add-on interpolation models. Compared with existing VAEs, the proposed UniVAE explores a unified way to compress videos both spatially and temporally with jointly designed encoder and decoder, thus achieving accurate and smooth video reconstruction at a high frame rate. Extensive experiments on commonly used public datasets for video reconstruction and generation demonstrate the superiority of the proposed UniVAE. The code and the pre-trained models will be released to facilitate further research.