VideoNSA: Native Sparse Attention Scales Video Understanding

Enxin Song; Wenhao Chai; Shusheng Yang; Ethan J. Armand; Xiaojun Shan; Haiyang Xu; Jianwen Xie; Zhuowen Tu

VideoNSA: Native Sparse Attention Scales Video Understanding

Enxin Song, Wenhao Chai, Shusheng Yang, Ethan J. Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, Zhuowen Tu

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient Video Understanding, Sparse Attention

TL;DR: VideoNSA shows that native sparse attention can scale video-language models reliably and efficiently.

Abstract: Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. **Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video.** Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global–local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 640

Loading