Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

ACL ARR 2026 May Submission13671 Authors

26 May 2026 (modified: 02 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Physical Tool Use, MLLM, Embodied AI

Abstract: Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the ``brain'' of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce \textit{PhysTool Bench}, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. \textit{PhysTool Bench} comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7\% of tools in a scene and completes merely 21.0\% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking,vision question answering

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

EMNLP 2026 AI Reviewing Experiment: yes

Submission Number: 13671

Loading