Reward-free Policy Optimization with World Models

Marc Höftmann; Jan Robine; Stefan Harmeling

Reward-free Policy Optimization with World Models

Marc Höftmann, Jan Robine, Stefan Harmeling

27 Sept 2024 (modified: 25 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reward-free, Goal-conditioned, World Models, Planning, AI Safety

TL;DR: This method learns policies without rewards by planning.

Abstract: As AI capabilities advance, their rapid progress is not keeping pace with the need for safe and value-aligned algorithms, raising concerns about autonomous systems. E.g., maximizing expected return in reinforcement learning can lead to unintended and potentially harmful consequences. This work introduces Reward-free Policy Optimization (RFPO), a method that prioritizes goal-oriented policy learning over reward maximization by eliminating rewards as the agent's learning signal. Our approach learns a world model that simulates backward in time, and then uses it to construct a directed graph for planning, and finally learning a goal-conditioned policy from the graph. The algorithm has two requirements: (1) the goal has to be defined, and (2) the agent needs sufficient world knowledge, enabling it to plan. This method removes the risks associated with reward hacking and discourages unintended behaviors by allowing for human oversight. Additionally, it provides a framework for humans to build transparent and high-level algorithms by using the (low-level) learned policies. We demonstrate the effectiveness of RFPO on maze environments with pixel observations, where the agent successfully reaches arbitrarily selected goals and follows human-designed algorithms. In conclusion, RFPO enables agents to learn policies without rewards and provides a framework for creating high-level behaviors.

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9610

Loading