Can Adversarial Examples be Parsed to Reveal Victim Model Information?

Published: 01 Jan 2025, Last Modified: 13 Jul 2025WACV 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Numerous adversarial attack methods have been developed to generate imperceptible image perturbations that cause erroneous predictions in state-of-the-art machine learning (ML) models, particularly deep neural networks (DNNs). Despite extensive research on adversarial examples, limited efforts have been made to explore the hidden characteristics carried by these perturbations. In this study, we investigate the feasibility of deducing information about the victim model (VM)—specifically, characteristics such as architecture type, kernel size, activation function, and weight sparsity—from adversarial examples. We approach this problem as a supervised learning task, where we aim to attribute categories of VM characteristics to individual adversarial examples. To facilitate this, we have assembled a dataset of adversarial attacks spanning seven types, generated from 135 victim models systematically varied across five architecture types, three kernel size configurations, three activation functions, and three levels of weight sparsity. We demonstrate that a supervised model parsing network (MPN) can effectively extract concealed details of the VM from adversarial examples. We also validate the practicality of this approach by evaluating the effects of various factors on parsing performance, such as different input formats and generalization to out-of-distribution cases. Furthermore, we highlight the connection between model parsing and attack transferability by showing how the MPN can uncover VM attributes in transfer attacks.
Loading