Training Open-ended Policies to follow Video-prompt Instructions with Reinforcement Learning

Kaichen He; Bowei Zhang; Zihao Wang; Shaofei Cai; QIANG FU; Haobo Fu; Anji Liu; Yitao Liang

Training Open-ended Policies to follow Video-prompt Instructions with Reinforcement Learning

Kaichen He, Bowei Zhang, Zihao Wang, Shaofei Cai, QIANG FU, Haobo Fu, Anji Liu, Yitao Liang

28 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Online reinforcement learning，open-ended environment，pretrained video conditioned policy

Abstract: In recent years, online reinforcement learning(RL) training methods like PPO have shone in important works such as Instruct GPT. However, unlike the success achieved in the language domain, online RL methods often struggle to generalize to untrained tasks in open-world environments like Minecraft, due to issues like overfitting. This has become a significant obstacle in using online methods to build a generalist agent. In this work, we notice the modality differences between natural language environments and embodied environments such as the Minecraft environment, which inspired us to use video instructions instead of text instructions to enhance the model's understanding of the relationship between the environment and instructions. We also introduce a new attention layer in the base model's encoder-decoder architecture to establish a semantic and visual dual-path information interaction channel, further strengthening this generalization capability. After training our model on a small set of tasks, it demonstrated excellent zero-shot generalization on new tasks, outperforming almost all other models in the Minecraft environment on our benchmark. Our approach takes a solid and important step toward unleashing the potential of online RL in building generalist agents. zero-shot generalization on new tasks, outperforming almost all other models in the Minecraft environment on our benchmark. Our approach takes a solid and important step toward unleashing the potential of online RL in building generalist agents.

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 14047

Loading