Do Vision Language Models infer human intention without visual perspective-taking? Towards a scalable "One-Image-Probe-All" dataset

Bingyang Wang; Yijiang Li; Qingyang Zhou; Hui Yi Leong; Tianwei Zhao; Letian Ye; Hokin Deng; Dezhi Luo; Nuno Vasconcelos

Do Vision Language Models infer human intention without visual perspective-taking? Towards a scalable "One-Image-Probe-All" dataset

Bingyang Wang, Yijiang Li, Qingyang Zhou, Hui Yi Leong, Tianwei Zhao, Letian Ye, Hokin Deng, Dezhi Luo, Nuno Vasconcelos

12 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-Modal Large Language Model, Scalable Benchmark, Theory-of-Mind (ToM), Knowledge Grounding

Abstract: At the core of understanding the knowledge grounding of Multimodal Large Language Models (MLLMs) are two key challenges: (1) ensuring fair comparability across concepts and (2) scaling multimodal datasets to reflect real-world complexity. This paper presents a solution through the Omni-Perspective benchmark, which scales the construction of a 5-level question-context-answers (QCAs) from 1 real-world image. This benchmark pertains to 3 concepts along the Theory-of-Mind (ToM) ability hierarchy in humans and is further divided into 10 fine-grained subdifficulties. Through inference tasks, complexity, and ablation analysis, we evaluate over 2,200 consolidated QCAs on 61 MLLMs. Our findings reveal a key observation: MLLMs mostly follow the human ToM grounding pathway with exception of level-2 perspective taking. Furthermore, this dataset enables nuanced analysis of how such observations change across varying difficulty levels, modalities, distractor logic, and prompt types.

Croissant File: json

Dataset URL: https://kaggle.com/datasets/24551424775233982841cd87686837acec4947cc56bdc35f9457477da419fd56

Primary Area: Datasets & Benchmarks for applications in computer vision

Submission Number: 2355

Loading