Revisiting Discrete Soft Actor-Critic

haibin zhou; Tong Wei; Zichuan Lin; junyou li; Junliang Xing; Yuanchun Shi; Li Shen; Chao Yu; Deheng Ye

Revisiting Discrete Soft Actor-Critic

haibin zhou, Tong Wei, Zichuan Lin, junyou li, Junliang Xing, Yuanchun Shi, Li Shen, Chao Yu, Deheng Ye

Published: 22 Nov 2024, Last Modified: 22 Nov 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: We study the adaption of Soft Actor-Critic (SAC), which is considered as a state-of-the-art reinforcement learning (RL) algorithm, from continuous action space to discrete action space. We revisit vanilla discrete SAC and provide an in-depth understanding of its Q value underestimation and performance instability issues when applied to discrete settings. We thereby propose Stable Discrete SAC (SDSAC), an algorithm that leverages entropy-penalty and double average Q-learning with Q-clip to address these issues. Extensive experiments on typical benchmarks with discrete action space, including Atari games and a large-scale MOBA game, show the efficacy of our proposed method. Our code is at: https://github.com/coldsummerday/SD-SAC.git.

Submission Length: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=Wum4m8tMVP

Changes Since Last Submission: In our previous TMLR submission, the action editor and reviewers fully acknowledged our work’s contribution to the RL community, our motivation for improving the discrete SAC algorithm, the comprehensive and promising results, and our clear writing. However, they expressed concerns about the lack of support for our analysis of the failure modes of discrete SAC and expected more explanation regarding our choice of the entropy penalty. In this submission, we have thoroughly addressed the issues mentioned above. Through experiments in multiple environments and an analysis of both the game environments and the SAC update mechanism, we identify a more underlying factor of training instability, namely the changes in the Q function distribution caused by deceptive rewards, which in turn lead to sudden shifts in policy entropy. Additionally, we have added discussions and experiments to demonstrate the necessity of using the entropy penalty instead of other methods such as the KL penalty. We highlight changes in orange text for this round of review. These modifications significantly enhance the persuasiveness and quality of the paper, and we believe it now meets the submission standard of TMLR.

Code: https://github.com/coldsummerday/SD-SAC.git

Supplementary Material: zip

Assigned Action Editor: ~Yu_Bai1

Submission Number: 3135

Loading