SACrificing Intuition: Kullback-Leibler Regularized Actor-Critic

TMLR Paper5843 Authors

08 Sept 2025 (modified: 17 Sept 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: One of the most popular algorithms in reinforcement learning is Soft Actor-Critc (SAC), as it promises to elegantly incorporate exploration into the optimization process. We revisit SAC through the lens of constrained optimization and develop \underline{K}ullback-\underline{L}eibler \underline{A}ctor-\underline{C}ritic (KLAC), a principled extension of Soft Actor Critic that replaces the heuristic entropy bonus of SAC with a Kullback-Leibler regulariser against an arbitrary reference policy. We contrast Kullback-Leibler Actor Critic with Soft Actor Critic and demonstrate analytically and with a concrete counterexample that injecting the entropy term directly into the reward, as implemented in Soft Actor Critic, violates the convexity assumptions of the dual proof of near-optimality and can render the learned policy arbitrarily sub-optimal no matter how small the temperature is chosen. This understanding reveals a fundamental systemic flaw in SAC, especially for sparse reward environments. To retain the empirical exploration benefits without sacrificing theoretical soundness, we introduce a fixed uniform reward bias that captures the intrinsic motivation effect to \textit{stay alive}. Additionally, we propose a Kullback-Leibler annealing schedule that unifies discrete and continuous action spaces by mapping an intuitive probability of exploitation to a closed-form entropy or Kullback-Leibler target. Together, these contributions yield an algorithm that at least matches the sample efficiency and performance of Soft Actor Critic as demonstrated on MuJoCo and MinAtar benchmarks while enjoying provable near optimality, interpretable hyperparameters, and a theoretically grounded exploration mechanism. We provide code to reproduce all plots in the paper.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Reza_Babanezhad_Harikandeh1
Submission Number: 5843
Loading