Abstract: Reinforcement Learning (RL) agents often struggle with inefficient exploration, particularly in environments with sparse rewards, where traditional exploration strategies lead to slow learning and suboptimal performance. This inefficiency comes from unsystematic exploration, where agents fail to effectively exploit past successful experiences, hindering both temporal credit assignment and exploration. To address this, we propose a self-imitating on-policy algorithm that enhances exploration by bootstrapping policy learning with past successful state-action transitions. To incorporate self-imitation, our method uses optimal transport distance for dense reward environments to prioritize the state visitation distribution that matches the most rewarding past trajectory. In sparse reward environments, we uniformly replay self-encountered successful trajectories to provide structured exploration. Experimental results across diverse environments—including MuJoCo for dense rewards and the partially observable 3D Animal-AI Olympics and multi-goal PointMaze for sparse rewards—demonstrate significant improvements in learning efficiency. Our approach achieves faster convergence and significantly higher success rates than state-of-the-art self-imitating RL baselines. These findings suggest that self-imitation is a promising strategy for improving exploration and can be extended to more complex RL tasks.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Dileep_Kalathil1
Submission Number: 4044
Loading