everyone
since 09 Aug 2024">EveryoneRevisionsBibTeXCC BY 4.0
In our previous TMLR submission, the action editor and reviewers fully acknowledged our work’s contribution to the RL community, our motivation for improving the discrete SAC algorithm, the comprehensive and promising results, and our clear writing. However, they expressed concerns about the lack of support for our analysis of the failure modes of discrete SAC and expected more explanation regarding our choice of the entropy penalty.
In this submission, we have thoroughly addressed the issues mentioned above. Through experiments in multiple environments and an analysis of both the game environments and the SAC update mechanism, we identify a more underlying factor of training instability, namely the changes in the Q function distribution caused by deceptive rewards, which in turn lead to sudden shifts in policy entropy. Additionally, we have added discussions and experiments to demonstrate the necessity of using the entropy penalty instead of other methods such as the KL penalty.
We highlight changes in orange text for this round of review. These modifications significantly enhance the persuasiveness and quality of the paper, and we believe it now meets the submission standard of TMLR.