Do State-of-the-art Audio-visual VLMs Understand Audio-video Temporal Misalignment

Published: 07 Aug 2025, Last Modified: 18 Aug 2025Gen4AVC PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: video and language; audio recognition; evaluation
Abstract: Audio-visual vision–language models (VLMs) have recently leapt forward, excelling at recognizing and localizing audio-visual events from videos, generating videos with audio from text prompt. Yet whether these models truly understand the temporal synchrony between what is seen and what is heard remains unanswered. Existing systems (i) mostly sparsely sample video frames, making accurate alignment challenging, or (ii) inherit M-RoPE/TM-RoPE encodings that are only reliable within a two-second window; all are trained and evaluated exclusively on perfectly aligned audio–video pairs. Understanding misalignment is critical: safety-critical applications require millisecond-level localisation of events, and temporal de-synchronisation is an emerging attack surface. We introduce a compact evaluation set that injects controlled audio–video time shifts into real-world clips and use it to test two leading audio-visual VLMs, Gemini 2.0 Flash and Qwen-2.5 Omni. Both models obtained accuracies below chance rate in recognizing mis-alignment between audio and visual information, exposing a clear gap in current audio-visual understanding and motivating alignment-aware model development.
Submission Number: 10
Loading