Keywords: long-horizon planning, manipulation planning, vision-language-action, multimodal reasoning
Abstract: Long-horizon robotic manipulation in scientific experiments requires strict procedural dependencies, multi-stage reasoning, and domain-aware manipulation skills that remain challenging for existing multimodal planning systems. Existing Vision-Language-Action (VLA) models excel at multimodal understanding but often lack explicit symbolic knowledge, limiting their compositional and interpretable planning ability. We present Compositional Multimodal Planner (CoMP), a hierarchical reasoning framework that decouples task understanding, perceptual reasoning, and skill execution for complex experimental procedures. CoMP consists of: (1) a task-level interpreter using chain-of-thought prompting to infer task logic, (2) a mid-level multimodal planner that integrates future scene prediction
to enable visually grounded reasoning, and (3) a low-level skill controller that executes actions via reinforcement learning. This decoupled design enables each component to be optimized independently, improving controllability, extensibility, and generalization without fine-tuning large models. To facilitate evaluation, we introduce a benchmark dataset for scientific experiment tasks. Experiments on both our benchmark and RLBench show that CoMP achieves strong performance and superior compositional generalization compared to competitive baselines, highlighting the advantages of structured and decoupled multimodal planning for long-horizon scientific workflows.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 6012
Loading