Abstract: This work presents a novel approach to zero-shot visual object goal navigation that leverages the ability of visual Large Language Model (vLLM) for finding target in unknown environment. Our system combines semantic mapping with vLLM-driven decision-making to direct robots towards target objects. The core of our approach lies in using vLLM to generate a value map between explored areas and the target object using cosine similarity based on prompt identically, incorporating both visual and semantic information from RGB-D image observations. This value map, along with a constructed semantic map and extracted movable frontier points, serves as a historic information for the vLLM to select one of the frontiers to explore next. We evaluate our method on two single-floor scenes from the Habitat-Matterport 3D dataset and Habitat Synthetic Scenes Dataset using the Habitat simulator separately. Our experiments demonstrate that the proposed approach has the potential to explore efficiently, particularly excelling when utilizing semantic information from simulator. The results show promise of our method in zero-shot navigation scenarios if overcome the common semantic extraction challenge. This work contributes to the growing field of language-driven exploration and exhibits how advanced large language model can effectively tackle complex navigation tasks.
Loading