video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language model, audio-visual, long video, memory, streaming
TL;DR: Audio-visual LLM for streaming understanding long videos
Abstract: Continuous, high-frame-rate, high-resolution processing of long video streams is critical for future AI agents, yet current video-understanding LLMs struggle to scale. Offline, fixed-frame-number methods require the stream length to adapt frame rates; streaming methods constrain memory by merging or discarding tokens, losing information. We propose video-SALMONN S, a streaming audio–visual LLM that processes arbitrary-length videos at 360p resolution under a fixed memory budget. The framework introduces (i) a test-time-training (TTT) memory module that continually updates token representations to capture long-range dependencies by replacing token merging, and (ii) a prompt-dependent memory reader that selectively retrieves context-relevant content from fixed-size memory. The TTT module is optimised with a Hessian-free conjugate-gradient procedure (TTT-HF) for efficient adaptation. On long-video benchmarks (Video-MME, LVBench, VideoEvalPro), video-SALMONN-S sustains high-quality understanding on multi-hour videos with >10k frames and ~1M tokens. Our 8B-parameter model achieves 74.2% overall and 67.8% on the Video-MME long split, significantly outperforming both offline and streaming baselines.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 14504
Loading