Keywords: Multimodal Large Language Models, Geometric Perception, Benchmark
TL;DR: We propose a novel benchmark, GePBench, to evaluate the geometric perception abilities of Multimodal Large Language Models.
Abstract: Geometric shapes play important roles in both physical world and human cognition. While multimodal large language models (MLLMs) have made significant advancements in visual understanding, their abilities to recognize geometric shapes and their spatial relationships, which we term geometric perception, are not explicitly and systematically explored. To address this gap, we introduce GePBench, a novel benchmark specifically designed to assess the geometric perception capabilities of MLLMs. Our extensive evaluations reveal that even the current state-of-the-art MLLMs exhibit significant deficiencies in geometric perception tasks. Furthermore, we show that models trained with GePBench data demonstrate considerable improvements on a wide range of downstream tasks, highlighting the critical role of geometric perception in enabling advanced multimodal applications. Our code and datasets will be publicly available.
Primary Area: datasets and benchmarks
Submission Number: 14798
Loading