Keywords: Vision-Language Models, Zero-Shot Planning, Bimanual Manipulation, Scene Understanding
TL;DR: A zero-shot, language-conditioned framework for bimanual manipulation unifies task reasoning and execution via structured scene understanding, explicit action planning, parallel scheduling and closed-loop validation, enabling context-aware behaviors.
Abstract: Robotic manipulation requires connecting high-level task objectives with the physical constraints of action execution. In this work, a zero-shot language-conditioned framework for bimanual manipulation is presented, where task-level reasoning is not decoupled from the way actions are executed, but addressed jointly within a unified decision process. The objective is to move beyond high-level manipulation primitives and enable decisions that account for how an action should be physically realised in context. The proposed framework grounds this process in structured scene understanding, planning over an action space that combines manipulation primitives with context-aware execution choices, and closed-loop validation of action feasibility and outcomes. Preliminary results, obtained using a bimanual quadrupedal system, indicate the potential of this perspective for supporting more diverse manipulation behaviors. In this regard, they suggest a promising direction for future Vision-Language-Action models, where structured action representations of this kind could serve as supervision targets for learning manipulation policies grounded in both task semantics and execution constraints.
Submission Number: 42
Loading