An Evolutionary Algorithm for Black-Box Adversarial Attack Against Explainable Methods

TMLR Paper5281 Authors

03 Jul 2025 (modified: 29 Oct 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The explainability of deep neural networks (DNNs) remains a major challenge in developing trustworthy AI, particularly in high-stakes domains such as medical imaging. Although explainable AI (XAI) techniques have advanced, they remain vulnerable to adversarial perturbations, underscoring the need for more robust evaluation frameworks. Existing adversarial attacks often focus on specific explanation strategies, while recent research has introduced black-box attacks capable of targeting multiple XAI methods. However, these approaches typically craft pixel-level perturbations that require a large number of queries and struggle to effectively attack less granular XAI methods such as Grad-CAM and LIME. To overcome these limitations, we propose a novel attack that generates perturbations using semi-transparent, RGB-valued circles optimized via an evolutionary strategy. This design reduces the number of tunable parameters, improves attack efficiency, and is adaptable to XAI methods with varying levels of granularity. Extensive experiments on medical and natural image datasets demonstrate that our method outperforms state-of-the-art techniques, exposing critical vulnerabilities in current XAI systems and highlighting the need for more robust interpretability frameworks.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We thank all reviewers for their constructive feedback, which has helped us improve the quality and clarity of our work. Several of the suggested experiments overlapped across reviews; **\#ImageNet**: As requested, we evaluated our approach on ImageNet to further demonstrate its generalizability. Results are shown below. **Table:** Pearson Correlation Coefficient (PCC) along with the percentage of images that satisfy the respective task constraint when attacking images from the ImageNet dataset. We provide the mean and variance of each metric over 10 runs. | Attack | Task 1: Constraint Satisfied (↑) | Task 1: PCC (↑) | Task 2: Constraint Satisfied (↑) | Task 2: PCC (↓) | |---------------|---------------------------------|----------------|---------------------------------|----------------| | Square-Attack | **81.84%(14.25)** | 0.50(0.052)‡ | 84.85%(49.49) | 0.20(0.001)‡ | | EvoAttack | 81.10%(13.74) | **0.75(0.030)†** | **85.92%(51.41)** | **-0.33(0.132)†** | | NES | 24.23%(0.274)‡ | 0.27(0.009)‡ | 78.16%(52.89)‡ | 0.44(0.013)‡ | | SAFARI | 19.05%(22.85)‡ | 0.48(0.052)‡ | 62.80%(74.85)‡ | 0.67(0.017)‡ | **\# Additional Black-box Attack**: We also conducted further comparisons against the state-of-the-art Square Attack (Andriushchenko et al., ECCV 2020), confirming the superior performance of our method (table below). **Table:** Pearson Correlation Coefficient (PCC) and percentage of images satisfying constraints for ImageNet attacks. **Table:** Pearson Correlation Coefficient (PCC) along with the percentage of images that satisfy the respective task constraint when attacking images from the HAM10000, Br35h, COVID-QU-Ex and ImageNet datasets. We provide the mean and variance of each metric over 10 runs. | Method | Constraint Satisfied (↑) | PCC (↑) | Constraint Satisfied (↑) | PCC (↓) | |---------------|-----------------------------|-----------------------------|-----------------------------|-----------------------------| | **Task 1** | | | **Task 2** | | | Square-Attack | **83.43% (2.379)** | 0.454 (0.141)‡ | 81.65% (2.545) | 0.283 (0.243)‡ | | EvoAttack | 82.23% (1.785) | **0.770 (0.077)†** | **82.80% (1.703)** | **-0.224 (0.261)†** | **\#Typos and Additional discussion**: We appreciate reviewers pointing the writing errors and missing clarifications. These have been corrected in the revised manuscript, with amended and additional text highlighted in blue. **\#Varying Budgets**: In response to reviewer suggestions, we ran additional experiments across all four datasets under varying query budgets. The results are reported in Section A.8 (Query Efficiency) of the Appendix. **\# Defence Comparisons**: We compared our method against two defense strategies—Random Noise Defense (Qin et al., NeurIPS 2021) and Randomized Smoothing (Cohen et al., ICML 2019)—in addition to adversarial training. Results are shown below. **Table**: | Defence | Task 1: Constraint Satisfied (↑) | Task 1: PCC (↑) | Task 2: Constraint Satisfied (↑) | Task 2: PCC (↓) | | -------------------- | -------------------------------- | ------------------- | -------------------------------- | -------------------- | | Adversarial Training | 83.33% (3.17)‡ | **0.62(0.005)†** | **71.45% (14.14)†** | **0.10 (0.025)†** | | Random Noise | 78.32% (1.92) | 0.87 (0.089)‡ | 93.62% (8.20) | -0.03 (0.301)‡ | | Random Smoothing | **77.93% (1.27)** | 0.83(0.010)‡ | 91.72 (10.19) | 0.0 (0.092)‡ |
Assigned Action Editor: ~Yingzhen_Li1
Submission Number: 5281
Loading