Keywords: Adversarial detection, prime quantization, Gromov-Wasserstein geometry, theoretical guarantees, robust ML
TL;DR: Our work introduces prime quantization and cross-space geometry as a principled foundation for adversarial detection with formal guarantees and broad empirical validation.
Abstract: Adversarial vulnerability persists across modern vision architectures from CNNs to vision language models (VLMs), yet existing detection methods rely on heuristics without theoretical guarantees. We address the fundamental question of when adversarial perturbations can be provably detected from a geometric perspective.
Our key insight is that adversarial perturbations cannot simultaneously preserve geometric structure across spaces with fundamentally different properties. Accordingly, we construct two such complementary metric spaces.
First, we use a standard CNN embedding space $Z$, where adversarial samples exhibit significant displacement patterns. Second, we build a novel prime-quantized space $P$, that absorbs small perturbations through number-theoretic discretization, resulting in minimal displacement, while preserving discriminability. We then leverage the geometric discrepancies across spaces $Z$ and $P$ to detect adversarial samples.
To the best of our knowledge, we establish the first rigorous separation theory for adversarial detection, proving that adversarial samples create unavoidable geometric inconsistencies across both spaces. Our framework provides theoretical guarantees including pixel-level absorption bounds, neighborhood diameter concentration, Gromov-Wasserstein (GW) separation theorems, and practical risk control.
Extensive experiments validate our theoretical predictions and achieve consistently strong detection performance across a wide range of attack types and model families.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 17313
Loading