Keywords: Adversarial detection, Geometric analysis, Multimodal models
Abstract: Vision-language pre-trained models (VLPs) are widely used in real-world applications. However, they remain vulnerable to adversarial attacks. Although adversarial detection methods have demonstrated success in single-modality settings (either vision or language), their effectiveness and reliability in multimodal models such as VLPs remain largely unexplored. In this work, we investigate the embedding spaces of VLPs and find that the image embedding space exhibits anisotropy. Our theoretical analysis shows that this anisotropic structure increases the separation between clean and adversarial examples (AEs) in the embedding space. Specifically, we demonstrate that AEs consistently exhibit greater expected distances to randomly sampled points than their clean counterparts, indicating that adversarial perturbations tend to push inputs out of manifold regions. Building on these insights, we propose GeoDetect, which leverages these off-manifold deviations to identify AEs. Through comprehensive evaluations, we show that our approach reliably detects adversarial attacks across various VLP architectures, including but not limited to CLIP, providing a robust and practical approach to improving the safety and reliability of these models.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 17780
Loading