Grounded Vision-Language Interpreter for Integrated Task and Motion Planning

Jeremy Siburian; Keisuke Shirai; Cristian Camilo Beltran-Hernandez; Masashi Hamaya; Michael Görner; Atsushi Hashimoto

Grounded Vision-Language Interpreter for Integrated Task and Motion Planning

Jeremy Siburian, Keisuke Shirai, Cristian Camilo Beltran-Hernandez, Masashi Hamaya, Michael Görner, Atsushi Hashimoto

Published: 22 Nov 2025, Last Modified: 22 Nov 2025SAFE-ROL PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Models, Task and Motion Planning, Interpretability

TL;DR: This work introduces the ViLaIn-TAMP framework that combines vision-language models with symbolic planning in a closed-loop system that refines robot plans using motion feedback.

Abstract: While recent advances in vision-language models (VLMs) have accelerated the development of language-guided robot planners, their black-box nature often lacks safety guarantees and interpretability crucial for real-world deployment. Conversely, classical symbolic planners offer rigorous safety verification but require significant expert knowledge for setup. To bridge the current gap, this paper proposes ViLaIn-TAMP, a hybrid planning framework for enabling verifiable, interpretable, and autonomous robot behaviors. ViLaIn-TAMP comprises three main components: (1) a Vision-Language Interpreter (ViLaIn) adapted from previous work that converts multimodal inputs into structured problem specifications using off-the-shelf VLMs without additional domain-specific training, (2) a modular Task and Motion Planning (TAMP) system that grounds the specifications into actionable trajectory sequences through symbolic and geometric constraint reasoning and leverages learning-based skills for complex manipulation, and (3) a corrective planning module which receives concrete task or motion failure feedback from the TAMP component and feeds adapted logic and geometric feasibility constraints back to ViLaIn to refine the specifications. Evaluated on five challenging manipulation tasks in a cooking domain, ViLaIn-TAMP outperforms direct VLM-as-a-planner approaches by more than 15% mean success rate, increasing interpretability. Finally, ViLaIn-TAMP’s closed-loop architecture exhibits a more than 30% higher mean success rate compared to without corrective planning, improving execution robustness.

Supplementary Zip: zip

Submission Number: 21

Loading