Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models
Keywords: Robotic Manipulation; Vision-Language Models; Assembly
TL;DR: We propose a novel framework that enables robots to perform complex assembly tasks guided by high-level manual instructions.
Abstract: Humans excel at interpreting abstract instruction manuals to perform complex manipulation tasks, which is a capability that remains challenging for robots.
We present Manual2Skill, a framework that enables robots to execute complex assembly tasks using high-level manual instructions. Our approach leverages a Vision-Language Model (VLM) to extract structured information from instructional images and constructs hierarchical assembly graphs capturing furniture parts and their relationships.
For execution, a pose estimation model predicts 6D poses of parts, and a motion planner generates executable actions.
We demonstrate the effectiveness of Manual2Skill by successfully assembling several real-world IKEA furniture items, highlighting its ability to manage long-horizon manipulation tasks with both efficiency and precision. This work advances robot learning from manuals and brings robots closer to human-level understanding and execution. Project Page: https://owensun2004.github.io/Furniture-Assembly-Web/
Submission Number: 6
Loading