Subgoal Proposition Using a Vision-Language Model

Published: 21 Oct 2023, Last Modified: 25 Oct 2023LangRob @ CoRL 2023 PosterEveryoneRevisionsBibTeX
Keywords: Subgoal Proposition, Language Grounding, Vision-Language Model
TL;DR: We investigate how the Flamingo architecture, a vision-language model, can enhance the efficiency of solving long-horizon instruction-following tasks in partially observable environments by proposing promising subgoals in HRL setting.
Abstract: Recent advances in large language models (LLMs) have inspired research on their potential for robots in real-world tasks. This study investigates whether the architecture of the vision-language model (VLM) Flamingo can help ground the knowledge of a pretrained causal LLM within an agent's experience when following instructions in long-horizon tasks within partially observable environments. To achieve this, we propose the VLM-based hierarchical reinforcement learning (HRL) agent that uses Flamingo's Perceiver Resampler and Cross-Attention mechanism with a causal LLM to capture mission status and suggest promising subgoals. We assess the impact of two design factors, namely the training mode of the pretrained LLM and history representation, through the evaluation of three training modes (*RandomInit*, *TuneAll*, and *FrozenAll*) and two history representations (*Full History* and *Abstract History*). The experiments conducted on the BabyAI platform demonstrate: 1) the VLM-HRL agent outperforms the baseline agent; 2) LLM pretraining is unhelpful for the VLM-HRL agent in environments with less-natural instructions and finetuning serves to undo pretraining; 3) history abstraction enhances learning efficiency and stability.
Submission Number: 30
Loading