Keywords: Vision-Language-Action (VLA) Models, Imitation Learning, Multimodal Instruction Following
TL;DR: We propose a framework for context-aware robot navigation that excels in both simulated and real-world environments.
Abstract: Real-life robot navigation involves more than simply reaching a destination; it requires optimizing movements while considering scenario-specific goals. Humans often express these goals through abstract cues, such as verbal commands or rough sketches. While this guidance may be vague or noisy, we still expect robots to navigate as intended. For robots to interpret and execute these abstract instructions in line with human expectations, they need to share a basic understanding of navigation concepts with humans. To address this challenge, we introduce CANVAS, a novel framework that integrates both visual and linguistic instructions for commonsense-aware navigation. CANVAS leverages imitation learning, enabling robots to learn from human navigation behavior. We also present COMMAND, a comprehensive dataset that includes human-annotated navigation results spanning over 48 hours and 219 kilometers, specifically designed to train commonsense-aware navigation systems in simulated environments. Our experiments demonstrate that CANVAS outperforms the strong rule-based ROS NavStack system across all environments, excelling even with noisy instructions. In particular, in the orchard environment where ROS NavStack achieved a 0% success rate, CANVAS reached a 67% success rate. CANVAS also closely aligns with human demonstrations and commonsense constraints, even in unseen environments. Moreover, real-world deployment of CANVAS shows impressive Sim2Real transfer, with a total success rate of 69%, highlighting the potential of learning from human demonstrations in simulated environments for real-world applications.
Submission Number: 87
Loading