Keywords: Physical Tool Use, MLLM, Embodied AI
Abstract: Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the ``brain'' of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce \textit{PhysTool Bench}, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. \textit{PhysTool Bench} comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7\% of tools in a scene and completes merely 21.0\% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking,vision question answering
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: yes
Submission Number: 13671
Loading