CLIP-Guided Reinforcement Learning for Open-Vocabulary Tasks

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: open-vocabulary, reinforcement learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Open-vocabulary ability is crucial for an agent designed to follow natural language instructions. In this paper, we focus on developing an open-vocabulary agent through reinforcement learning. We leverage the capability of CLIP to segment the target object specified in language instructions from the image observations. The resulting confidence map replaces the text instruction as input to the agent's policy, grounding the natural language into the visual information. Compared to the giant embedding space of natural language, the two-dimensional confidence map provides a more accessible unified representation for neural networks. When faced with instructions containing unseen objects, the agent converts textual descriptions into comprehensible confidence maps as input, enabling it to accomplish open-vocabulary tasks. Additionally, we introduce an intrinsic reward function based on the confidence map to more effectively guide the agent towards the target objects. Our single-task experiments demonstrate that our intrinsic reward significantly improves performance. In multi-task experiments, through testing on tasks out of the training set, we show that the agent, when provided with confidence maps as input, possesses open-vocabulary capabilities.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3362
Loading