Keywords: Vision Language Models, Robotic Manipulation, Task and Motion Planning
Abstract: We present a modular robotic manipulation framework, robot agnostic in nature that executes long-horizon tasks
specified through natural language instruction. The system integrates a Vision-Language Supervisory Planner (VLM-SP), a
Grasp-Pose Estimator (GPE), and a structured skill repository
containing the robot’s executable skills. Given a natural language
instruction, the VLM planner decomposes the task into grounded
subtasks-an essential step for reducing planning complexity,
enabling skill reuse, and ensuring robust execution. Each subtask
is then mapped to appropriate skills and paired with object
specific grasp predictions for reliable manipulation.
Submission Number: 9
Loading