A Robust Framework for Open-Vocabulary Multi-Object Search with Visual-Language Understanding

Qianwei Wang; Yifan Xu; Vineet Kamat; Carol Menassa

A Robust Framework for Open-Vocabulary Multi-Object Search with Visual-Language Understanding

Qianwei Wang, Yifan Xu, Vineet Kamat, Carol Menassa

Published: 18 Apr 2025, Last Modified: 06 May 2025ICRA 2025 FMNS PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Language Navigation, Object Search

TL;DR: An open-vocabulary multi-object search framework that integrates VLMs, frontier-based exploration, and POMDP-based planning to achieve efficient and robust multi-object search.

Abstract: Object search is a fundamental task for robots deployed in indoor building environments, yet challenges arise due to observation instability, especially for open-vocabulary models. While foundation models (LLMs/VLMs) enable reasoning about object locations even without direct visibility, the ability to recover from failures and replan remains crucial. The Multi-Object Search (MOS) problem further increases complexity, requiring the tracking multiple objects and thorough exploration in novel environments, making observation uncertainty a significant obstacle. To address these challenges, we propose a framework integrating VLM-based reasoning, frontier-based exploration, and a Partially Observable Markov Decision Process (POMDP) framework to solve the MOS problem in novel environments. VLM enhances search efficiency by inferring object-environment relationships, frontier-based exploration guides navigation in unknown spaces, and POMDP models observation uncertainty, allowing recovery from failures in occlusion and cluttered environments.

Submission Number: 23

Loading