Keywords: Compositional Generalization, Robot Manipulation, Robot Planning
Abstract: The rapid growth of robotics has been driven by advances in both hardware and algorithms, yet a fundamental gap remains between real-world decision making and virtual simulations. Traditional designs, such as single grippers or human-like dual arms, often fail to fully exploit algorithmic capabilities or handle tasks constrained by embodiment, such as lifting thin cards or manipulating heavy and bulky objects. To address this hardware–software mismatch, we introduce RoboMonster, a new paradigm that integrates heterogeneous end-effectors with a cross-end-effector embodied planning brain. RoboMonster reasons over visual inputs, task instructions, and the properties of its diverse end-effectors to select and coordinate optimal agents, decomposing complex problems into executable sub-tasks. We design four specialized end-effectors, train corresponding policies, and develop a high-level planner based on combinatorial logical, spatial, and temporal constraints to ensure safe and efficient multi-arm collaboration. Experiments across challenging tasks demonstrate that RoboMonster significantly outperforms systems relying on a single gripper, highlighting the advantages of combining heterogeneous end-effectors with structured planning for embodied intelligence.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7876
Loading