DRAGON: A Dialogue-Based Robot for Assistive Navigation with Visual Language Grounding

Shuijing Liu; Aamir Hasan; Kaiwen Hong; Runxuan Wang; Peixin Chang; Zachary Mizrachi; Justin Lin; D Livingston McPherson; Wendy Rogers; Katherine Rose Driggs-Campbell

DRAGON: A Dialogue-Based Robot for Assistive Navigation with Visual Language Grounding

Shuijing Liu, Aamir Hasan, Kaiwen Hong, Runxuan Wang, Peixin Chang, Zachary Mizrachi, Justin Lin, D Livingston McPherson, Wendy Rogers, Katherine Rose Driggs-Campbell

Published: 07 May 2025, Last Modified: 10 May 2025ICRA Workshop Human-Centered Robot LearningEveryoneRevisionsBibTeXCC BY 4.0

Workshop Statement: This paper contributes to the Human-Centered Robot Learning (HCRL) @ ICRA 2025 workshop by addressing the challenge of deploying foundation models for real-world assistive navigation. DRAGON, our dialogue-based assistive robot, integrates visual-language models (VLMs) to interpret natural language commands and ground them in real-time navigation and environmental awareness. This approach aligns with the workshop’s theme of leveraging large models and big data to improve human-robot interaction, AI alignment, and safety in human-centered settings. Furthermore, this work contributes to key workshop discussions on trustworthy AI and data accessibility in human-robot interaction. Unlike virtual AI assistants, embodied AI must navigate physical and social constraints, ensuring both safety and intuitive interaction. DRAGON’s use of open-vocabulary landmark recognition and dialogue-driven disambiguation provides a real-world test case for aligning foundation models with human expectations in assistive settings. These insights are critical for advancing human-centered robot learning and ensuring that large-scale models are deployed responsibly in embodied AI systems.

Keywords: Human-Centered Robotics, Natural Dialog for HRI, AI-Enabled Robotics

TL;DR: DRAGON is a dialogue-based assistive robot that uses visual-language models to provide real-time wayfinding and environmental awareness for persons with visual impairments, showcasing how foundation models improve human-robot interaction.

Abstract: Persons with visual impairments (PwVI) have difficulties understanding and navigating spaces around them. Current wayfinding technologies either focus solely on navigation or provide limited communication about the environment. Motivated by recent advances in visual-language grounding and semantic navigation, we propose \modelName, a guiding robot powered by a dialogue system and the ability to associate the environment with natural language. By understanding the commands from the user, \modelName is able to guide the user to the desired landmarks on the map, describe the environment, and answer questions from visual observations. Through effective utilization of dialogue, the robot can ground the user's free-form language to the environment, and give the user semantic information through spoken language. We conduct a user study with blindfolded participants in an everyday indoor environment. Our results demonstrate that \modelName is able to communicate with the user smoothly, provide a good guiding experience, and connect users with their surrounding environment in an intuitive manner.

Submission Number: 4

Loading