Compositional Multimodal Reasoning for Long-Horizon Robotic Manipulation in Scientific Experiments

Jinghan Yang; Jingyi Hou; Xinbo Yu; Yifan WU; Wei He

Compositional Multimodal Reasoning for Long-Horizon Robotic Manipulation in Scientific Experiments

Jinghan Yang, Jingyi Hou, Xinbo Yu, Yifan WU, Wei He

15 Sept 2025 (modified: 19 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: long-horizon planning, manipulation planning, vision-language-action, multimodal reasoning

Abstract: Long-horizon robotic manipulation in scientific experiments requires strict procedural dependencies, multi-stage reasoning, and domain-aware manipulation skills that remain challenging for existing multimodal planning systems. Existing Vision-Language-Action (VLA) models excel at multimodal understanding but often lack explicit symbolic knowledge, limiting their compositional and interpretable planning ability. We present Compositional Multimodal Planner (CoMP), a hierarchical reasoning framework that decouples task understanding, perceptual reasoning, and skill execution for complex experimental procedures. CoMP consists of: (1) a task-level interpreter using chain-of-thought prompting to infer task logic, (2) a mid-level multimodal planner that integrates future scene prediction to enable visually grounded reasoning, and (3) a low-level skill controller that executes actions via reinforcement learning. This decoupled design enables each component to be optimized independently, improving controllability, extensibility, and generalization without fine-tuning large models. To facilitate evaluation, we introduce a benchmark dataset for scientific experiment tasks. Experiments on both our benchmark and RLBench show that CoMP achieves strong performance and superior compositional generalization compared to competitive baselines, highlighting the advantages of structured and decoupled multimodal planning for long-horizon scientific workflows.

Supplementary Material: zip

Primary Area: applications to robotics, autonomy, planning

Submission Number: 6012

Loading