CAVINR: Coordinate-Aware Attention for Video Implicit Neural Representations

Jialong Guo; Ke Liu; MENGXUAN LI; Jiajun Bu; Haishuai Wang

CAVINR: Coordinate-Aware Attention for Video Implicit Neural Representations

Jialong Guo, Ke Liu, MENGXUAN LI, Jiajun Bu, Haishuai Wang

16 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Implicit Neural Representations, Neural Representations for Videos.

Abstract: Implicit Neural Representations (INRs) have emerged as a compelling paradigm, with Neural Representations for Videos (NeRV) achieving remarkable compression ratios by encoding videos as neural network parameters. However, existing NeRV-based approaches face fundamental scalability limitations: computationally expensive per-video optimization through iterative gradient descent and convolutional architectures with shared kernel parameters that provide weak pixel-level control and limit global dependency modeling essential for high-fidelity reconstruction. We introduce CAVINR, a pure transformer framework that fundamentally departs from convolutional approaches by leveraging persistent cross-attention mechanisms. CAVINR introduces three contributions: a transformer encoder that compresses videos into compact video tokens encoding spatial textures and temporal dynamics; a coordinate-attentive decoder utilizing persistent weights and cross-attention between coordinate queries and video tokens; and temperature-modulated attention with block query processing that enhances reconstruction fidelity while reducing memory complexity. Comprehensive experiments demonstrate CAVINR's superior performance: 6-9 dB PSNR improvements over state-of-the-art methods, $10^5\times$ encoding acceleration compared to gradient-based optimization, $85-95\%$ memory reduction, and 7.5$\times$ faster convergence with robust generalization across diverse video content, enabling practical deployment for large-scale video processing applications.

Supplementary Material: pdf

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 7841

Loading