Match or Replay: Self Imitating Proximal Policy

TMLR Paper4044 Authors

24 Jan 2025 (modified: 31 Mar 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Reinforcement Learning (RL) agents often struggle with inefficient exploration, particularly in environments with sparse rewards, where traditional exploration strategies lead to slow learning and suboptimal performance. This inefficiency comes from unsystematic exploration, where agents fail to effectively exploit past successful experiences, hindering both temporal credit assignment and exploration. To address this, we propose a self-imitating on-policy algorithm that enhances exploration by bootstrapping policy learning with past successful state-action transitions. To incorporate self-imitation, our method uses optimal transport distance for dense reward environments to prioritize the state visitation distribution that matches the most rewarding past trajectory. In sparse reward environments, we uniformly replay self-encountered successful trajectories to provide structured exploration. Experimental results across diverse environments—including MuJoCo for dense rewards and the partially observable 3D Animal-AI Olympics and multi-goal PointMaze for sparse rewards—demonstrate significant improvements in learning efficiency. Our approach achieves faster convergence and significantly higher success rates than state-of-the-art self-imitating RL baselines. These findings suggest that self-imitation is a promising strategy for improving exploration and can be extended to more complex RL tasks.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Dileep_Kalathil1
Submission Number: 4044
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview