Keywords: Benchmark, Multi-modal Large Language Model, Visual Reasoning, Real World Environments, Evaluation
Abstract: Multi-modal Large Language Models (MLLMs) represent a significant advancement in artificial intelligence. Among the growing capabilities exhibited by MLLMs, abilities to understand and reason in real-world environments stand out as particularly vital as a fundamental prerequisite for a wide array of real-world applications. The current methods for evaluating MLLMs often fall short in their ability to comprehensively assess these crucial capabilities. However, being able to reason on complex environment-scale spaces, for example, room spaces, building spaces, and even urban spaces, and to predict the future and plan actions, is essential for humans and various autonomous agents to survive in the real physical world. To address these gaps, we propose a visual-question-answering benchmark, **SpaCE-Eval** (**Spa**tial Reasoning, **C**ommonsense Knowledge and **E**nvironment Interaction) in the real world, designed to evaluate some of MLLM’s most important reasoning abilities in real-world environments. As the name suggests, it challenges the models to reason on complex spatial scenarios, invoke commonsense knowledge of the physical world, and interact with the environment. The dataset consists of all new diagrams purposefully produced by humans, where diagram-question pairs are meticulously refined and selected through a rigorous pipeline. Additionally, with the benchmark, we evaluate a selection of leading MLLMs, both proprietary and open source. The results suggest that a significant enhancement of MLLMs in reasoning in the real physical world is necessary to realise more advanced general artificial intelligence.
Primary Area: datasets and benchmarks
Submission Number: 17260
Loading