M-Attack-V2: Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

21 Apr 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Vision-Language Model, Adversarial Attack, Muti-modality Models
Abstract: Black-box adversarial attacks on Large Vision–Language Models (LVLMs) present unique challenges due to the absence of gradient access and complex multimodal decision boundaries. While prior \texttt{M-Attack} demonstrated notable success with exceeding 90\% attack success rate on GPT‑4o/o1/4.5 by leveraging local crop-level matching between source and target data, we show this strategy introduces high-variance gradient estimates. Specifically, we empirically find that gradients computed over randomly sampled local crops are nearly orthogonal, violating the implicit assumption of coherent local alignment and leading to unstable optimization. To address this, we propose a theoretically grounded {\bf \em gradient denoising} framework that redefines the adversarial objective as an expectation over local transformations. Our first component, \emph{Multi-Crop Alignment (MCA)}, estimates the expected gradient by averaging gradients across diverse, independently sampled local transformations. This manner significantly reduces gradient variance, thus enhancing convergence stability. Recognizing an asymmetry in the roles of source and target transformations, we also introduce \emph{Auxiliary Target Alignment (ATA)}. ATA regularizes the optimization by aligning the adversarial example not only with the primary target image but also with auxiliary samples drawn from a semantically correlated distribution. This constructs a smooth semantic trajectory in the embedding space, acting as a low-variance regularizer over the target distribution. Finally, we reinterpret prior momentum as replay through the lens of local matching as variance-minimizing estimators under the crop-transformed objective landscape. Momentum replay stabilizes and amplifies transferable perturbations by maintaining gradient directionality across local perturbation manifolds. Together, MCA, ATA, momentum replay, and a delicately selected ensemble set constitute \texttt{M-Attack-V2}, a principled framework for robust black-box LVLM attack. Empirical results show that our framework improves the attack success rate on GPT‑4o from {\bf 95\%$\rightarrow$99\%}, on Claude-3.7, and on Gemini-2.5-Pro from {\bf 83\%$\rightarrow$97\%}, significantly surpassing all existing black-box LVLM attacking methods.
Supplementary Material: zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 3398
Loading