How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Muhammad Uzair Khattak; Muhammad Ferjad Naeem; Jameel Hassan Abdul Samadh; Muzammal Naseer; Federico Tombari; Fahad Khan; Salman Khan

How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Jameel Hassan Abdul Samadh, Muzammal Naseer, Federico Tombari, Fahad Khan, Salman Khan

05 Apr 2024 (modified: 13 Nov 2024)Submitted to NeurIPS 2024 Track Datasets and BenchmarksEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: Multi-modal Large language models, Video Understanding, Video Question Answering

TL;DR: Video QA benchmark to assess the reasoning and robustness of Video-LMMs across 11 world-centric complex video dimensions

Abstract: Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical surgery, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives underscores the importance of ensuring and evaluating their robust performance in mirroring human-like reasoning and interaction capabilities in complex, real-world contexts. However, existing benchmarks for Video-LMMs primarily focus on general video comprehension abilities and neglect assessing their reasoning capabilities over complex videos in the real-world context, and robustness of these models through the lens of user prompts as text queries. In this paper, we present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES), a novel benchmark that comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. We evaluate 11 recent models, including both open-source and closed-source variants, and find that most of the Video-LMMs, especially open-source ones, struggle with robustness and reasoning when dealing with complex videos. Based on our analysis, we develop a training-free Dual-Step Contextual Prompting (DSCP) technique to effectively enhance the performance of existing Video-LMMs on CVRR-ES benchmark. Our findings provide valuable insights for building the next generation of human-centric AI systems with advanced robustness and reasoning capabilities. Our dataset and code are publicly available at: https://mbzuai-oryx.github.io/CVRR-Evaluation-Suite/.

Supplementary Material: zip

Submission Number: 20

Loading