Keywords: Implicit Neural Representations, Neural Representations for Videos.
Abstract: Implicit Neural Representations (INRs) have emerged as a compelling paradigm, with Neural Representations for Videos (NeRV) achieving remarkable compression ratios by encoding videos as neural network parameters. However, existing NeRV-based approaches face fundamental scalability limitations: computationally expensive per-video optimization through iterative gradient descent and convolutional architectures with shared kernel parameters that provide weak pixel-level control and limit global dependency modeling essential for high-fidelity reconstruction. We introduce CAVINR, a pure transformer framework that fundamentally departs from convolutional approaches by leveraging persistent cross-attention mechanisms. CAVINR introduces three contributions: a transformer encoder that compresses videos into compact video tokens encoding spatial textures and temporal dynamics; a coordinate-attentive decoder utilizing persistent weights and cross-attention between coordinate queries and video tokens; and temperature-modulated attention with block query processing that enhances reconstruction fidelity while reducing memory complexity. Comprehensive experiments demonstrate CAVINR's superior performance: 6-9 dB PSNR improvements over state-of-the-art methods, $10^5\times$ encoding acceleration compared to gradient-based optimization, $85-95\%$ memory reduction, and 7.5$\times$ faster convergence with robust generalization across diverse video content, enabling practical deployment for large-scale video processing applications.
Supplementary Material: pdf
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 7841
Loading