Explore until Confident: Efficient Exploration for Embodied Question Answering

Published: 05 Apr 2024, Last Modified: 26 Apr 2024VLMNM 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Embodied Question Answering, Vision Language Model, Conformal prediction
TL;DR: We propose a new framework that combines VLM's commonsense reasoning and rigorous uncertainty quantification to enable efficient exploration in Embodied Question Answering tasks.
Abstract: We consider the problem of Embodied Question Answering (EQA), where a robot needs to actively explore an environment to gather information until it is confident about the answer to a question. We leverage the strong semantic reasoning capabilities of large vision-language models (VLMs) to efficiently explore and answer such questions. We first build a semantic map of the scene based on depth information and via visual prompting of a VLM — leveraging its vast knowledge of relevant regions of the scene for exploration. Next, we use conformal prediction to calibrate the VLM’s question answering confidence, allowing the robot to know when to stop exploration --- leading to a more calibrated and efficient exploration strategy. To test our framework in simulation, we also contribute a new EQA dataset with diverse scenes built upon the Habitat-Matterport 3D Research Dataset (HM3D). Both simulated and real robot experiments show our proposed approach improves the performance and efficiency over baselines.
Supplementary Material: zip
Submission Number: 41
Loading