Keywords: Embodied Agents, Memory, Long Context VLMs
TL;DR: A scalable benchmark in Habitat testing agents on long-horizon embodied tasks requiring memory, contextual reasoning, and navigation.
Abstract: Vision-Language models (VLMs) have recently demonstrated impressive performance in planning and control tasks, driving interest in their application to robotics. Yet their deployment in embodied settings remains limited by the challenge of incorporating long-term experience, often spanning multiple days and represented by vast image collections. Current VLMs typically handle only a few hundred images at once, underscoring the need for more efficient mechanisms to manage long-term memory in embodied contexts. To meaningfully evaluate these models for long-horizon control, a benchmark must target scenarios where memory is essential. Existing long-video QA benchmarks neglect embodied challenges like object manipulation and navigation, which require low-level skills and fine-grained reasoning over past interactions. Moreover, effective memory integration in embodied agents involves both recalling relevant historical information and executing actions based on that information, making it essential to study these aspects together. In this work, we introduce FindingDory, a new benchmark for long-range embodied tasks in the Habitat simulator. FindingDory evaluates memory-centric capabilities across 60 tasks requiring sustained engagement and contextual awareness in an environment. The tasks can also be procedurally extended to longer and more challenging versions, enabling scalable evaluation of memory and reasoning. We further present baselines that integrate state-of-the-art closed-source and fine-tuned open-source VLMs with low-level navigation policies, assessing their performance on these memory-intensive tasks and highlighting key areas for improvement.
Primary Area: datasets and benchmarks
Submission Number: 21764
Loading