VLMs Hate Ads Too: Evaluating Robustness to Ad Interruptions in Videos

ACL ARR 2026 January Submission2577 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models, Video Understanding, Robustness, Video Question Answering, Position Bias
Abstract: Current video benchmarks mostly rely on clean footage, overlooking the ubiquity of advertisements. To bridge this gap, we introduce Ads-VideoMMMU, a benchmark evaluating VLM robustness against realistic ad interruptions. Our experiments reveal that ads serve as semantic distractors that systematically impair perception and comprehension, causing state-of-the-art models like GPT-4o to suffer accuracy drops of up to 9.3\%. Notably, we identify a "Prefix Penalty": ads at the beginning of a video cause more damage than ads in other positions. We find two entangled causes for this: models find it much harder to distinguish initial ads from the main content,and a "Visual Primacy Effect" where models over-prioritize early visual inputs. Furthermore, we characterize "last-mile errors" as a common failure mode under ad interference and propose a lightweight two-agent framework that effectively mitigates these failures.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering,video processing,multimodality,robustness,adversarial attacks/examples/training
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 2577
Loading