Exploration in Policy Mirror Descent

Jincheng Mei; Chenjun Xiao; Ruitong Huang; Dale Schuurmans; Martin Muller

Exploration in Policy Mirror Descent

Jincheng Mei, Chenjun Xiao, Ruitong Huang, Dale Schuurmans, Martin Muller

27 Sept 2018 (modified: 05 May 2023)ICLR 2019 Conference Withdrawn SubmissionReaders: Everyone

Abstract: Policy optimization is a core problem in reinforcement learning. In this paper, we investigate Reversed Entropy Policy Mirror Descent (REPMD), an on-line policy optimization strategy that improves exploration behavior while assuring monotonic progress in a principled objective. REPMD conducts a form of maximum entropy exploration within a mirror descent framework, but uses an alternative policy update with a reversed KL projection. This modified formulation bypasses undesirable mode seeking behavior and avoids premature convergence to sub-optimal policies, while still supporting strong theoretical properties such as guaranteed policy improvement. An experimental evaluation demonstrates that this approach significantly improves practical exploration and surpasses the empirical performance of state-of-the art policy optimization methods in a set of benchmark tasks.

Keywords: Reinforcement Learning, Exploration, Policy Optimization

1 Reply

Loading