Preference-Based Long-Horizon Robotic Stacking with Multimodal Large Language Models

Published: 01 Feb 2026, Last Modified: 01 Feb 2026CoRL 2025 Workshop LEAP (Rolling)EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Long-Horizon Manipulation, Robotic Stacking, Multimodal Reasoning, Large Language Models
TL;DR: A multimodal large language model as high-level planner for long-horizon robotic stacking tasks.
Abstract: Pretrained large language models (LLMs) can work as high-level robotic planners by reasoning over abstract task descriptions and natural language instructions etc. However, they have shown a lack of knowledge and effectiveness for planning long-horizon robotic manipulation tasks where the physical properties of the objects are essential. An example is stacking of containers with hidden objects inside, which involves reasoning over hidden physics properties such as weight and stability. To this end, this paper proposes to use multimodal LLMs as high-level planners for such long-horizon robotic stacking tasks. The LLM takes multimodal inputs for each object to stack and infers the current best stacking sequence by reasoning over stacking preferences. Given explicit instructions to consider weight and stability at the same time as the stacking preference, the Kawada NEXTAGE humanoid robot showcased the successful stacking of three boxes with various hidden objects guided by an LLM on-the-fly in the real world. Furthermore, in order to enable the LLM to reason over multiple preferences at the same time without giving explicit instructions, we propose to create a custom dataset to fine-tune the LLM. We simulate all possible stacks of boxes with various contents in physics simulation, and generate training samples for stacking preferences including weight, stability, size, and foothold.
Submission Number: 15
Loading