Keywords: Robustness, Visual Noise Benchmark, Video understanding, Video-LLM
Abstract: The recent progress of Video Large Language Models (Video-LLMs) has greatly advanced multimodal understanding and reasoning in video analysis. However, the robustness of these models under diverse noise conditions that commonly occur in real-world scenarios remains largely unexplored, and existing research lacks systematic evaluations to comprehensively assess Video MLLM performance on question answering tasks under various noise conditions, both of which limit their reliability in practical deployments. To bridge this gap, we propose a thorough robustness benchmark encompassing 36 noise types in 8 categories, spanning diverse video categories and question types, resulting in 21,924 noise-corrupted test videos for comprehensive evaluation of Video-LLMs’ robustness. We evaluate 10 state-of-the-art Video-LLMs on this novel benchmark for the initial systematic evaluation from multiple perspectives. Our multi-faceted analysis uncovers existing bottlenecks and performance degradation under certain noise scenarios, particularly for tasks requiring fine-grained understanding and reasoning. Additionally, we also examine the effectiveness of current image restoration techniques in mitigating noise effects and discuss their limitations. By constructing this extensive benchmark, our work lays a foundation for the systematic evaluations of Video MLLMs, offering insightful findings for future research on robust video-language understanding.
Primary Area: datasets and benchmarks
Submission Number: 19132
Loading