Keywords: Robotic manipulation, Large Language Models, Imitation Learning, hierarchical policy
TL;DR: We propose a hierarchical framework that integrates VLM-LLM task planning with a multimodal learning augmented by action-mask policy, enabling accurate, efficient, and generalizable multi-task robotic manipulation across diverse scenarios.
Abstract: Hierarchical policies that integrate high-level planning with low-level control have shown performance in robotic manipulation, but remain limited.
We present a hierarchical framework that combines a two-stage task planner with a low-level action planner that integrates multimodal inputs and an explicit action-mask policy.
At the high level, a Vision-Language Model (VLM) first perceives object and scene information from observations, and a Large Language Model (LLM) then reasons over this information together with a task library and human instructions to generate a textual task plan.
This two-stage design mitigates modality bias.
At the low level, we employ an asymmetric encoder, using SigLIP2 with Weight-Decomposed Low-Rank Adaptation (DoRA) for text and ResNets for multi-view vision.
We introduce a shared Temperature-Scaled Spatial Attention module to enhance multi-view features and a Bidirectional Cross-Attention module to fuse language-vision features for Action Chunking Transformer (ACT) policy.
For multi-task switching, we propose a novel explicit action-mask policy that jointly predicts actions and their validity masks.
The policy learns not only fine-grained control but also when to stop, enabling real-time sub-task completion detection and robust switching across long-horizon tasks without additional inference overhead.
Experiments on weighing and multi-object manipulation scenarios demonstrate planning accuracy, execution success, and efficiency, with ablations confirming the contribution of each component.
Finally, deployment on a different robotic platform in a new scenario validates generalization.
The video and code are available at https://hierarchical-llm-robotics.github.io.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 12430
Loading