Perceptual Neural Video Compression with Video Variational AutoEncoder at Low Bitrates

Chuqin Zhou; Xiaoyue Ling; Yuan Tian; Kai Lin; Yibo Shi; Yuxin Zhao; Jing Wang; Guo Lu

Perceptual Neural Video Compression with Video Variational AutoEncoder at Low Bitrates

Chuqin Zhou, Xiaoyue Ling, Yuan Tian, Kai Lin, Yibo Shi, Yuxin Zhao, Jing Wang, Guo Lu

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: video compression, image compression, neural network, end-to-end optimization

TL;DR: Learned Video Compression

Abstract: Existing neural video compression methods typically rely on frame-wise coding frameworks. Motion estimation and compensation are used to eliminate inter-frame redundancy, and compression performance is further enhanced through explicit residual or implicit conditional coding. However, these methods are primarily optimized for distortion, leading to significant degradation in perceptual quality at low bitrates. In this paper, we propose a novel learning-based video compression framework that leverages the compression and generative capabilities of video variational autoencoders. Unlike traditional frame-wise processing, our method operates on groups of frames, effectively improving perceptual quality at low bitrates. Specifically, we utilize video variational autoencoders to eliminate both temporal and spatial redundancy, encoding video clips into a perception-oriented latent space. Then, transform coding is applied to further capture spatial dependencies, yielding a more compressible latent representation. Finally, entropy coding is used to compress the quantized latent representation of the group of frames. Since each group of pictures is treated independently, our method can naturally be processed in parallel for acceleration. To incorporate information from adjacent frame groups and maintain temporal consistency across groups, we introduce an overlapping processing strategy, ensuring smooth transitions between adjacent groups. Extensive experimental results on benchmark datasets such as HEVC, UVG, and MCL-JCV demonstrate that our framework outperforms existing methods in terms of perceptual metrics.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6862

Loading