Keywords: Perception for Grasping and Manipulation, Bimanual Manipulation, AI-based Methods
TL;DR: We present a dual-arm framework using vision-language models to select grasps and arm roles from RGB-D proposals, outperforming geometry-only baselines on nine real-world bimanual tasks without task-specific training.
Abstract: Bimanual manipulation requires joint reasoning over object affordances and arm allocation, a challenge for geometry-only planners. To address this, we propose a hierarchical framework leveraging Vision-Language Models (VLMs) for task-aware bimanual affordance prediction without category-specific training. Our approach fuses multi-view RGB-D data to generate global 6-DoF grasps, which the VLM filters to determine task-relevant contact regions and optimal arm assignments. Evaluated on a dual-arm robot across nine real-world tasks—including tool use and human handovers—our approach significantly outperforms existing baselines, demonstrating that VLM-guided semantic reasoning enables highly reliable bimanual manipulation in unstructured environments.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 37
Loading