ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation

Enyu Zhao; Vedant Raval; Hejia Zhang; Jiageng Mao; Zeyu Shangguan; Stefanos Nikolaidis; Yue Wang; Daniel Seita

ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation

Enyu Zhao, Vedant Raval, Hejia Zhang, Jiageng Mao, Zeyu Shangguan, Stefanos Nikolaidis, Yue Wang, Daniel Seita

Published: 08 Aug 2025, Last Modified: 16 Sept 2025CoRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Models, Robotics Benchmark, Robot Manipulation

TL;DR: We propose a benchmark of multiple choice questions to test VLMs' reasoning ability of low-level robot manipulation.

Abstract: Vision-Language Models (VLMs) have revolutionized artificial intelligence and robotics due to their commonsense reasoning capabilities. In robotic manipulation, VLMs are used primarily as high-level planners, but recent work has also studied their lower-level reasoning ability, which refers to making decisions about precise robot movements. However, the community currently lacks a clear and common benchmark that can evaluate how well VLMs can aid low-level reasoning in robotics. Consequently, we propose a novel benchmark, ManipBench, to evaluate the low-level robot manipulation reasoning capabilities of VLMs across various dimensions, including how well they understand object-object interactions and deformable object manipulation. We extensively test 35 common and state-of-the-art VLM families on our benchmark, including variants to test different model sizes. The performance of VLMs significantly varies across tasks, and there is a strong correlation between this performance and trends in our real-world manipulation tasks. It also shows that there remains a significant gap between these models and human-level understanding.

Supplementary Material: zip

Spotlight: mp4

Submission Number: 448

Loading