Match or Replay: Self Imitating Proximal Policy

Match or Replay: Self Imitating Proximal Policy

TMLR Paper4044 Authors

24 Jan 2025 (modified: 22 May 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Reinforcement Learning (RL) agents often struggle with inefficient exploration, particularly in environments with sparse rewards. Traditional exploration strategies can lead to slow learning and suboptimal performance because agents fail to systematically build on previously successful experiences, resulting in poor sample efficiency. To tackle this issue, we propose a self-imitating on-policy algorithm aimed at enhancing exploration and sample efficiency by leveraging past high-reward state-action pairs to guide policy updates. Our method incorporates self-imitation by utilizing optimal transport distance in dense reward environments to prioritize state visitation distributions matching the most rewarding trajectory. For sparse reward environments, we uniformly replay successful self-encountered trajectories to facilitate structured exploration. Experimental results across diverse environments demonstrate substantial improvements in learning efficiency, including MuJoCo for dense rewards and the partially observable 3D Animal-AI Olympics and multi-goal Point- Maze for sparse rewards. Our approach achieves faster convergence and significantly higher success rates compared to state-of-the-art self-imitating RL baselines. These findings underscore the potential of self-imitation as a robust strategy for enhancing exploration in RL, with applicability to more complex tasks.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Dileep_Kalathil1

Submission Number: 4044

Loading