Keywords: Embodied Agentic AI, Vision Language Models, Hierarchical Reinforcement Learning
TL;DR: MARS: A dual-agent framework for few-shot manipulation skill acquisition and reuse
Abstract: Recent advances in Vision Language Models (VLMs) have enabled agentic robotic frameworks capable of interpreting natural language instructions and generating robot commands in the task-level space or the joint-level space. However, existing approaches predominantly rely on either zero-shot execution with no modular skill reuse methods or task-specific policy learning that must be retrained for each new behaviors. In this paper, we propose an embodied dual-agent architecture that allows agents to acquire new skills by exploring inputs and hyperparameter spaces of model-based controllers and reusing successfully learned skills through a memory skill module in future tasks. The exploratory behavior is enabled through the interaction between VLM planning and monitoring agents, where the former determines a long-term plan and suggests the next skill and the latter explores skill changes that can lead to task success.
Thus, we call this framework Model-based Acquisition and Retrieval of Skills (MARS). We showcase MARS across different embodiments in simulation and real-world through a variety of prehensile and non-prehensile, short and long-horizon tasks. Ablation studies on multiple embodiments in real-world and simulation show that the monitoring agent improves success rates from $25 \%$ to $77 \%$, while decreasing the number of trials till task success from $6.25$ to $1.5$. While the skill memory module improves success rates from $39\%$ to $70.5\%$. The skill memory also allows transferring skills to new objects by reusing strategies learned from previous interactions.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 20
Loading