Keywords: Multimodal Large Language Models (MLLMs), Pointing, Open Evaluation, Visual Grounding, Language-guided pointing, MLLM Arena
TL;DR: An Open Platform for Evaluating Multimodal Large Language Models on Visually Grounded Reasoning via Pointing Across Multiple Evaluation Stages
Abstract: Pointing serves as a fundamental and intuitive mechanism for grounding language within visual contexts, with applications spanning robotics, assistive technologies, and interactive AI systems. While recent multimodal models have begun supporting pointing capabilities, existing benchmarks typically focus only on referential object localization. We introduce PointArena, a comprehensive platform for evaluating multimodal pointing across diverse reasoning scenarios. PointArena comprises three components: (1) Point-Bench, a curated dataset of approximately 1,000 pointing tasks across five reasoning categories; (2) Point-Battle, an interactive web-based arena facilitating blind, pairwise model comparisons, which has collected over 4,500 anonymized votes; and (3) Point-Act, a real-world robotic manipulation system allowing users to directly evaluate model pointing in practical settings. We conducted extensive evaluations of both state-of-the-art open-source and proprietary models. Results indicate that Molmo-72B consistently outperforms others, though proprietary models increasingly demonstrate comparable performance. Additionally, we find that supervised training targeting pointing tasks significantly improves performance. Across our multi-stage evaluation pipeline, we observe strong correlations, underscoring the critical role of precise pointing in enabling multimodal models to bridge abstract reasoning with real-world actions.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 18166
Loading