Hierarchical LLM-Guided Multi-Task Manipulation with Multimodal Learning and Action-Mask Policy

Ningquan Gu; Yuquan Li; Kazuhiro Kosuge; Mitsuhiro Hayashibe

Hierarchical LLM-Guided Multi-Task Manipulation with Multimodal Learning and Action-Mask Policy

Ningquan Gu, Yuquan Li, Kazuhiro Kosuge, Mitsuhiro Hayashibe

18 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Robotic manipulation, Large Language Models, Imitation Learning, hierarchical policy

TL;DR: We propose a hierarchical framework that integrates VLM-LLM task planning with a multimodal learning augmented by action-mask policy, enabling accurate, efficient, and generalizable multi-task robotic manipulation across diverse scenarios.

Abstract: Hierarchical policies that integrate high-level planning with low-level control have shown performance in robotic manipulation, but remain limited. We present a hierarchical framework that combines a two-stage task planner with a low-level action planner that integrates multimodal inputs and an explicit action-mask policy. At the high level, a Vision-Language Model (VLM) first perceives object and scene information from observations, and a Large Language Model (LLM) then reasons over this information together with a task library and human instructions to generate a textual task plan. This two-stage design mitigates modality bias. At the low level, we employ an asymmetric encoder, using SigLIP2 with Weight-Decomposed Low-Rank Adaptation (DoRA) for text and ResNets for multi-view vision. We introduce a shared Temperature-Scaled Spatial Attention module to enhance multi-view features and a Bidirectional Cross-Attention module to fuse language-vision features for Action Chunking Transformer (ACT) policy. For multi-task switching, we propose a novel explicit action-mask policy that jointly predicts actions and their validity masks. The policy learns not only fine-grained control but also when to stop, enabling real-time sub-task completion detection and robust switching across long-horizon tasks without additional inference overhead. Experiments on weighing and multi-object manipulation scenarios demonstrate planning accuracy, execution success, and efficiency, with ablations confirming the contribution of each component. Finally, deployment on a different robotic platform in a new scenario validates generalization. The video and code are available at https://hierarchical-llm-robotics.github.io.

Supplementary Material: zip

Primary Area: applications to robotics, autonomy, planning

Submission Number: 12430

Loading