Memory-Maze: Scenario Driven Benchmark and Visual Language Navigation Model for Guiding Blind People
Workshop Statement: This paper seeks to bring attention to human understanding of route instructions in visual language navigation (VLN), particularly in scenarios where the robot seeks route instructions from humans in the real world. In traditional VLN settings, route instructions are annotated by humans with perfect knowledge of the routes, as the human annotators have the luxury of experiencing the routes repeatedly to refresh their memories. Additionally, traditional route instructions are shorter and are mostly limited to small, simpler spaces such as houses or offices. By collecting route instruction data from the real world, we discovered that memory-based instructions often contain ambiguities and errors. However, humans may be able to detect these errors in the instructions, whereas existing leading VLN models fail to do so. Additionally, existing VLN models struggle to follow long route instructions in large public spaces. This is because existing models lack a recovery mechanism should they fail while following the routes.
Our paper contributes to the workshop's theme by adding discussion to the following topics. **AI alignment**: We observed a significant difference between VLN route instruction data obtained in the traditional idealized setting and those obtained in the real-world memory-based setting. Future VLN models need to address this gap. **Data accessibility**: We proposed a new paradigm to collect VLN route instruction data more aligned with human interpretations. **Applications in human-robot interaction**: We based our studies on the real-world scenario where a blind person uses a navigation robot to obtain route instructions from passersby and proceed towards a destination, when they enter an area they have never been to before.
Keywords: Vision-Based Navigation, Performance Evaluation and Benchmarking
TL;DR: We present our benchmark Memory-Maze, which contains novel realistic route instruction data from human memory, and show that existing VLN models perform sub-optimally on our benchmark.
Abstract: Visual Language Navigation (VLN) powered robots have the potential to guide blind people by understanding route instructions provided by sighted passersby. This capability allows robots to operate in environments often unknown a priori. Existing VLN models are insufficient for the scenario of navigation guidance for blind people, as they need to understand routes described from human memory, which frequently contains stutters, errors, and omissions of details, as opposed to those obtained by thinking out loud, such as in the R2R dataset. However, existing benchmarks do not contain instructions obtained from human memory in natural environments. To this end, we present our benchmark, Memory-Maze, which simulates the scenario of seeking route instructions for guiding blind people. Our benchmark contains a maze-like structured virtual environment and novel route instruction data from human memory. Our analysis demonstrates that instructions data collected from memory were longer and contained more varied wording. We further propose a VLN model better equipped to handle the scenario by leveraging Large Language Models (LLM) and show that existing state-of-the-art models perform suboptimally on our benchmark.
Submission Number: 20
Loading