TL;DR: We introduce a method that uses the visual-text alignment score as a reward for better generalization.
Abstract: Recent work have shown that incorporating pre-trained multimodal representations can enhance the ability of an instruction-following agent to generalize to unseen situations. Yet training such agents often requires a dataset consisting of diverse demonstrations, which may not be available for target domains and incur a huge cost to collect. In this paper, we instead propose to utilize the knowledge captured within large vision-language models for improving the generalization capability of control agents. To this end, we present Multimodal Reward Decision Transformer (MRDT), a simple yet effective method that uses the visual-text alignment score as a reward. This reward, which adapts based on the progress towards achieving the text-specified goals, is used to train a return-conditioned policy that guides the agent towards the desired goals. We also introduce a fine-tuning scheme that adapts pre-trained multimodal models using in-domain data to improve the quality of rewards. Our experiments demonstrate that MRDT significantly improves generalization performance in test environments with unseen goals. Moreover, we introduce new metrics for evaluating the quality of multimodal rewards and show that generalization performance increases as the quality of rewards improves.
Keywords: Reinforcement Learning, Multimodal Representation, Imitation Learning
Submission Number: 58
Loading