Abstract: Traditional adversarial attacks rely on perturbations generated using gradients from the target network. These methods typically employ gradient-guided search to construct adversarial counterparts that deceive the model. In this paper, we propose a novel mechanism for generating adversarial examples in which the original image is not directly corrupted. Instead, its latent-space representation is manipulated to alter the inherent structure of the image while preserving perceptual quality, allowing the modified samples to appear as legitimate data.
In contrast to gradient-based attacks, our latent-space poisoning approach exploits the tendency of classifiers to assume that training data are drawn from an independent and identically distributed (i.i.d.) distribution. By producing carefully crafted out-of-distribution samples, the method deceives the classifier without introducing visible perturbations.
To achieve this, we train a disentangled variational autoencoder ($\beta$-VAE) to model the data distribution in the latent space. We then inject noise perturbations sampled from a class-conditioned distribution into the latent representation, under the constraint that the reconstructed sample is misclassified as a specified target label. Empirical results on the MNIST, SVHN, and CelebA datasets demonstrate that the generated adversarial examples can successfully fool robust classifiers designed with provable defenses under $\ell_0$, $\ell_2$, and $\ell_\infty$ threat models.
Loading