Reflective Policy Optimization

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Reinforcement Learning; on-policy
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: On-policy reinforcement learning methods, such as Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), often require significant data to be collected at each update, giving rise to issues of sample inefficiency. This paper introduces a novel extension to on-policy methods called Reflective Policy Optimization (RPO). RPO's fundamental objective is amalgamating prior and subsequent state and action information from trajectory data to optimize the current policy. This approach empowers the agent to engage in introspection and introduce modifications to its actions within the current state to a certain degree. Furthermore, theoretical analyses substantiate that our proposed method not only upholds the crucial property of monotonically improving policy performance but also adeptly contracts the solution space of the optimized policy, consequently expediting the training procedure. We empirically demonstrate the feasibility and efficacy of our approach in reinforcement learning benchmarks, culminating in superior performance in terms of sample efficiency.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6789
Loading