Benchmarking Visual Knowledge in Multimodal Large Language Models

03 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Model, Video Large Language Model, Visual Knowledge, Benchmark, Datasets
Abstract: While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This capability, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored gap in current systems. To systematically measure this capability, we present VKBench, a comprehensive video benchmark featuring 1,680 questions in 1,249 videos, covering eight core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Results show that leading models still fall short of human performance, with particularly notable gaps in world-centric visual knowledge. To bridge this gap, we introduce VKQA, a new dataset, and Video-VK+, a baseline model that explicitly incorporates visual knowledge into MLLMs. Video-VK+ follows a structured See–Think–Answer format and adopts reinforcement learning with visual knowledge reward. This approach improves performance on VKBench by 3.7% and surpasses existing models on multiple video benchmarks. Our findings highlight visual knowledge as a key component for developing more robust and generalizable MLLMs that can not only see but also truly understand our world.
Primary Area: datasets and benchmarks
Submission Number: 1472
Loading