Can Exploration Save Us from Adversarial Attacks? A Reinforcement Learning Approach to Adversarial Robustness

Can Exploration Save Us from Adversarial Attacks? A Reinforcement Learning Approach to Adversarial Robustness

ICLR 2026 Conference Submission21468 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: adversarial robustness, gradient-based attacks, reinforcement learning, exploration, image classification, transfer attacks, robustness analysis

Abstract: Although considerable progress has been made toward enhancing the robustness of deep neural networks (DNNs), they continue to exhibit significant vulnerability to gradient-based adversarial attacks in supervised learning (SL) settings. We investigate adversarial robustness under reinforcement learning (RL), training image classifiers with policy-gradient objectives and $\epsilon$-greedy exploration. When training models with several architectures on CIFAR-10, CIFAR-100, and ImageNet-100 datasets, RL consistently improves adversarial accuracy under white-box gradient-based attacks. Our results show that on a representative 6-layer CNN, adversarial accuracy increases from approximately 5\% to 55\% on CIFAR-10, 2\% to 25\% on CIFAR-100, and 5\% to 18\% on ImageNet-100, while clean accuracy decreases only 3–5\% relative to SL. However, transfer analysis reveals that adversarial examples crafted on RL models transfer poorly: both SL and RL retain approximately 43\% accuracy against these attacks. In contrast, adversarial examples crafted on SL models transfer effectively, reducing both SL and plain RL to around 8\% accuracy. This indicates that while plain RL can prevent the generation of strong adversarial examples, it remains vulnerable to transferred attacks from other models, thus requiring adversarial training (RL-adv, $\sim$30\% adversarial accuracy) for comprehensive defense against cross-model attacks. Analysis of loss geometry and gradient dynamics show that RL induces smaller gradient norms and rapidly changing input-gradient directions, reducing exploitable information for gradient-based attackers. Despite higher computational overhead, these findings suggest RL-based training can complement existing defenses by naturally smoothing loss landscapes, motivating hybrid approaches that combine SL efficiency with RL-induced gradient regularization.

Primary Area: reinforcement learning

Submission Number: 21468

Loading