Improving Adversarial Training for Two-player Competitive Games via Episodic Reward Engineering

Improving Adversarial Training for Two-player Competitive Games via Episodic Reward Engineering

TMLR Paper5580 Authors

08 Aug 2025 (modified: 14 Aug 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Training adversarial agents to attack neural network policies has proven to be both effective and practical. However, we observe that existing methods can be further enhanced by distinguishing between states leading to win or lose and encouraging the policy training by reward engineering to prioritize winning states. In this paper, we introduce a novel adversarial training method with reward engineering for two-player competitive games. Our method extracts the historical evaluations for states from historical experiences with an episodic memory, and then incorporating these evaluations into the rewards with our proposed reward revision method to improve the adversarial policy optimization. We evaluate our approach using two-player competitive games in MuJoCo simulation environments, demonstrating that our method establishes the most promising attack performance and defense difficulty against the victims among the existing adversarial policy training techniques.

Submission Length: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=hbQ5mDi64p

Changes Since Last Submission: From the last submission, we found that reviewers were mainly concerned about a clearer statement of experimental goals, a brief introduction to the environments, and the theoretical analysis. Therefore, in this manuscript we add: - A list of the goals of the experiments. - A brief introduction to our experimental environments. - A theoretical proof that reward shaping using episodic feedback preserves policy optimality. - Fixing formatting issues mentioned by Reviewer 98ds - Adding the parameter selection from appendix to main context. ----- Especially regarding the final decision from Action Editor ED9N in our last submission, we make the following clarifications: - **Theoretical background.** Action Editor ED9N noted a lack of theoretical background and referenced [1]. After reviewing [1], we found that the theoretical analysis sketch in our rebuttal to Reviewer 6utp matches the background the Action Editor requested. Accordingly, we now include a more comprehensive treatment: a full proof of policy optimality in the appendix of the revised manuscript. - **Episodic memory vs. critic network.** Action Editor ED9N suggested that our episodic memory is related to the critic network in actor–critic methods. We acknowledge that both produce scalar evaluations of states (and actions), which is common practice. However, our memory has completely different objectives and uses from a critic: it computes an empirical, cross-episode statistic that distinguishes “winning” from “losing” state patterns; this signal is injected into the reward purely as a shaping term. By contrast, a critic is a bootstrapped value estimator (Q or V) used inside the policy gradient as advantages. Since they approximate different quantities, are trained with different signals, and serve different downstream purposes (reward shaping vs. variance reduction), the revised manuscript does not add a discussion of critic networks. [1] Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping, ICML 1999

Assigned Action Editor: ~Tongzheng_Ren1

Submission Number: 5580

Loading