MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Perspective Understanding, MLLM Evaluation Benchmark, MLLM Spatial Reasoning, Multimodal LLMs
Abstract: Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources are available at https://yunlong10.github.io/MMPerspective/
Croissant File: json
Dataset URL: https://kaggle.com/datasets/23790d67d7cd45fa602f32c30bbbdbbd8a80f881951f7a4a1380aee3c5dcedca
Code URL: https://github.com/yunlong10/MMPerspective
Supplementary Material: pdf
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Submission Number: 608
Loading