MoVie: Multimodal Video Compression with Text Guidance

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Compression
TL;DR: MoVie is a text-guided multimodal video codec that unifies video-centric Transformer–CNN blocks with history-conditioned coding, achieving large perceptual BD-rate gains (−50.23% FID/−28.27% LPIPS vs. HM) with ~44% less compute than DCVC-FM.
Abstract: Recent advances in deep video compression have significantly improved rate-distortion performance. Compared to traditional codecs that rely on handcrafted motion estimation and block-based prediction, deep learning-based methods can learn more flexible and content-adaptive representations, leading to better compression efficiency. However, most existing approaches still focus primarily on low-level pixel motion modeling and lack semantic awareness, which limits their ability to preserve perceptual quality in complex scenes. In this paper, we propose **MoVie**, a **M**ultim**o**dal **Vi**d**e**o compression framework built upon a Text-guided Video Transformer–CNN mixed block(Text-VideoTCM). Instead of relying on image-oriented feature extractors that ignore temporal cues, we design a video-focused network, jointly modeling local spatial structures and temporal dynamics, achieving a remarkable trade-off between computational cost and perceptual performance. To enhance semantic perception, a dual-stage text fusion mechanism is introduced: Extractor modules distill text-aware features at early layers, while Injector modules inject refined semantics in deeper stages. We also introduce a new recipe history-conditioned coding that adaptively leverages both previous and aggregated historical frames, alongside a spatial-channel factorized entropy model tailored for window-based Transformer, which jointly captures local spatial structures and inter-channel dependencies. Averaged over the UVG and MCL-JCV datasets, MoVie achieves substantial BD-rate reductions relative to HM: **-50.23%** for FID and **-14.64%** for LPIPS(VGGNet). While maintaining superior perceptual quality, our method substantially reduces computational cost, requiring only **55.76%** of the per-pixel kMACs of DCVC-FM.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6670
Loading