Keywords: Policy Mirror Descent, Regularization, Reinforcement Learning
Abstract: Policy Mirror Descent (PMD) has emerged as a unifying framework in reinforce-
ment learning (RL) by linking policy gradient methods with a first-order optimiza-
tion method known as mirror descent. At its core, PMD incorporates two key
regularization components: (i) a distance term that enforces a trust region for stable
policy updates and (ii) an MDP regularizer that augments the reward function to
promote structure and robustness. While PMD has been extensively studied in
theory, empirical investigations remain scarce. This work provides a large-scale
empirical analysis of the interplay between these two regularization techniques,
running over 500k training seeds on small RL environments. Our results demon-
strate that, although the two regularizers can partially substitute each other, their
precise combination is critical for achieving robust performance. These findings
highlight the potential for advancing research on more robust algorithms in RL.
Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.
Serve As Reviewer: ~Jan_Felix_Kleuker1
Track: Fast Track: published work
Publication Link: kleukerjf@liacs.leidenuniv.nl
Submission Number: 105
Loading