Learning Diverse and Effective Policies with Non-Markovian Rewards

Bohao Qu; Jielong Yang; Xiaoqiang Ren; Hechang Chen; Jifeng Hu; Xiaofeng Cao; SHENG ZHANG; Yi Chang

Learning Diverse and Effective Policies with Non-Markovian Rewards

Bohao Qu, Jielong Yang, Xiaoqiang Ren, Hechang Chen, Jifeng Hu, Xiaofeng Cao, SHENG ZHANG, Yi Chang

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: policy diversity, non-Markovian Rewards, reinforcement learning

TL;DR: We propose a diversity matrix to quantify policy diversity and theoretically prove that if the diversity matrix is positive definite, then the diversity of policies can be achieved without sacrificing their effectiveness.

Abstract: Learning a set of diverse and high-quantity policies is a difficult problem in Reinforcement Learning since the diversity of policies is demanded to be achieved without dampening their effectiveness. This problem becomes more challenging when the rewards are non-Markovian, i.e., the rewards depend on the history of states and actions, which are quite sparse and returned over a long period. The sparse supervision signals and the non-Markovian properties of the rewards hinder the learning of policy embeddings and thus the learning of diverse and high-quality policies. In this paper, we propose to use a diversity matrix to quantify policy diversity and theoretically prove that if the diversity matrix is positive definite, then the diversity of policies can be achieved without sacrificing their effectiveness. The policy diversity matrix stems from policy embeddings. To obtain high-quality embeddings, we adopt a transformer to capture mutual dependencies between states and actions and design pseudo tasks to overcome sparse rewards. Experimental results show that our method can achieve a set of policies with more effective diversity and better performance than multiple recently proposed baseline methods in a variety of non-Markovian and Markovian environments.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)

Supplementary Material: zip

5 Replies

Loading