Hierarchical Vision-Language-Action Policies for Global Reasoning in Assembly Tasks

Published: 01 Feb 2026, Last Modified: 01 Feb 2026CoRL 2025 Workshop LEAP (Rolling)EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Planning, Hierarchical Models, Vision-Language-Models, Imitation Learning
TL;DR: We use a Vision-Language Model to convert multimodal, human-style assembly instructions (text and sketches) into simple, visually-grounded conditioning signals for a downstream manipulation policy.
Abstract: The rapid development of robotic control policies holds great potential for automating industrial assembly, especially for small-batch production, where traditional methods are not cost-effective. To achieve the necessary flexibility, robots must interpret the same multimodal assembly instructions used by human workers, which typically combine text with visual elements like technical drawings to convey spatial and semantic information. While existing policies primarily focus on a single instruction modality at a time, we specifically address the challenge of combining information from complementary text and sketch instructions for robot action prediction. To this end, we propose a hierarchical framework where a fine-tuned vision-language model (VLM) acts as a high-level planner, converting multimodal instructions into a sequence of symbolic subtasks in the context of the environment observation. Crucially, each symbolic subtask is spatially grounded by a corresponding mask predicted over the robot’s visual observation. The modular design improves interpretability and adaptability, and by offloading instruction understanding to the VLM planner, it reduces the data an notation burden on the manipulation policy. We validate our approach on a novel benchmark designed for quantifying multimodal instruction execution capabilities and conduct careful ablation studies to assess the impact of key design decisions.
Submission Number: 4
Loading