Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks
Keywords: Orientation Understanding, 3D Scene Understanding, MLLM Probing, Benchmark Dataset, Computer Vision
Abstract: Object orientation understanding represents a fundamental challenge in visual perception that underpins critical real-world applications like robotic manipulation and augmented reality. However, current vision-language benchmarks fail to isolate and evaluate this core capability, often conflating it with positional relationships (such as above/below or proximity between objects) and general scene understanding. To address this, we introduce \textbf{DORI} (\textbf{D}iscriminative \textbf{O}rientation \textbf{R}easoning \textbf{I}ntelligence), a comprehensive hierarchical benchmark that establishes object orientation perception as a primary evaluation target. DORI rigorously assesses four essential dimensions of object(s) orientation comprehension: frontal alignment, rotational transformations, relative directional relationships, and canonical orientation understanding. DORI provides valuable insights on how existing multi-modal systems process and understand object orientations through carefully curated tasks from 14 sources that spans $67$ object categories across synthetic and real-world scenarios. Our evaluation of $18$ state-of-the-art vision-language models using DORI reveals critical limitations: even the best models achieve only $54.2\%$ accuracy on coarse tasks and $33.0\%$ on granular orientation judgments, with performance deteriorating substantially for tasks requiring reference frame shifts or compound rotations. These findings demonstrate the urgent need for dedicated orientation representation mechanisms in future architectures, as models show a systematic inability to perform precise angular estimations, track orientation changes across multiple viewpoints, and understand compound rotations—suggesting fundamental limitations in their internal 3D spatial representations. As the first diagnostic framework specifically designed for advancing orientation awareness in multimodal systems, DORI offers immediate implications for improving robotic control, 3D scene reconstruction, and human-AI interaction in physical environments
Primary Area: datasets and benchmarks
Submission Number: 13396
Loading