Keywords: embodied agent, reasoning, high-level planning
Abstract: Vision-Language Models (VLMs) have become a powerful foundation for embodied agents, which are typically fine-tuned on expert demonstrations of successful task completions. However, collecting expert demonstrations is prohibitively expensive, and additionally, training exclusively on these ideal trajectories leaves agents brittle and struggle to recover from inevitable errors. To address this issue, we introduce INFUSER, INjecting synthetic FailUre for Self-correcting Embodied agent. Our idea is to augment existing expert trajectories with automatically generated failure-and-recovery scenarios (i.e., no human cost),
rather than collecting additional (costly) expert demonstrations. Specifically, we synthesize these data by injecting suboptimal actions into ground-truth paths, creating a diverse set of controlled failure scenarios. By fine-tuning on this augmented dataset, INFUSER learns to take corrective actions and recover from mistakes. Our experiments validate the effectiveness of INFUSER through comprehensive evaluations on benchmarks for embodied agents including EB-ALFRED and EB-Habitat;
training the Qwen2.5-VL-7B model by augmenting with our synthetic failure-tolerant data improves its performance by 18.3\% $\rightarrow$ 47.0\% and 59.7\% $\rightarrow$ 66.3\% on EB-ALFRED and EB-Habitat, respectively, achieving state-of-the-art performance among open-source models and even surpassing Qwen2.5-VL-72B with 10× fewer parameters. These results demonstrate that learning to recover from failures through synthetic augmentation, rather than collecting additional expert demonstrations, is a cost-effective approach to building robust embodied agents.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 16447
Loading