RETHINK MAXIMUM STATE ENTROPY

Hongming Li; Chenyang Zhao; Haiping Wang; Yixiong Wei; Changqing Zou

RETHINK MAXIMUM STATE ENTROPY

Hongming Li, Chenyang Zhao, Haiping Wang, Yixiong Wei, Changqing Zou

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning for Exploration, Maximum Entropy, Intrinsic Rewards

TL;DR: Simpliset Maximum State Entropy Ever.

Abstract: In the absence of specific tasks or extrinsic reward signals, a key objective for an agent is the efficient exploration of its environment. A widely adopted strategy to achieve this is maximizing state entropy, which encourages the agent to uniformly explore the entire state space. Most existing approaches for maximum state entropy (MaxEnt) are rooted in two foundational approaches, which were proposed by Hazan and Liu \& Abbeel, respectively. However, a unified perspective on these methods is lacking within the community. In this paper, we analyze these two foundational approaches within a unified framework and demonstrate that both methods share the same reward function when employing the $k$NN density estimator. We also show that the $\eta$-based policy sampling method proposed by Hazan is unnecessary and that the primary distinction between the two lies in the frequency with which the locally stationary reward function is updated. Building on this analysis, we introduce MaxEnt-(V)eritas, which combines the most effective components of both methods: iteratively updating the reward function as defined by Liu \& Abbeel, and training the agent until convergence before updating the reward functions, akin to the procedure used by Hazan. We prove that MaxEnt-V is an efficient $\varepsilon$-optimal algorithm for maximizing state entropy, where the tolerance $\varepsilon$ decreases as the number of iterations increases. Empirical validation in three Mujoco environments shows that MaxEnt-Veritas significantly outperforms the two MaxEnt frameworks in terms of both state coverage and state entropy maximization, with sound explanations for these results.

Supplementary Material: zip

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9139

Loading