$\texttt{M-Attack-V2}$: Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting
Keywords: Large Vision-Language Model, Adversarial Attack, Muti-modality Models
Abstract: Black-box adversarial attacks on Large Vision–Language Models (LVLMs) present unique challenges due to the absence of gradient access and complex multimodal decision boundaries. While prior $\texttt{M-Attack}$ demonstrated notable success in exceeding 90% attack success rate on GPT‑4o/o1/4.5 by leveraging local crop-level matching between source and target data, we show this strategy introduces high-variance gradient estimates. Specifically, we empirically find that gradients computed over randomly sampled local crops are nearly orthogonal, violating the implicit assumption of coherent local alignment and leading to unstable optimization.
To address this, we propose a theoretically grounded **gradient denoising** framework that redefines the adversarial objective as an expectation over local transformations. Our first component, *Multi-Crop Alignment (MCA)*, estimates the expected gradient by averaging gradients across diverse, independently sampled local transformations. This significantly reduces gradient variance, enhancing convergence stability. Recognizing an asymmetry in the roles of source and target transformations, we introduce *Auxiliary Target Alignment (ATA)*. ATA regularizes optimization by aligning the adversarial example not only with the primary target image but also with auxiliary samples from a semantically correlated distribution. This forms a smooth semantic trajectory in the embedding space, acting as a low-variance regularizer.
Finally, we reinterpret prior momentum replay within local matching as variance-minimizing estimators under the crop-transformed objective landscape. Momentum replay stabilizes and amplifies transferable perturbations by preserving gradient directionality across local perturbation manifolds. Together, MCA, ATA, momentum replay, and a delicately selected ensemble constitute $\texttt{M-Attack-V2}$, a principled framework for robust black-box LVLM attack.
Empirical results show our framework significantly improves the attack success rate on Claude-4.0 from **8% → 30%**, on Gemini-2.5-Pro from **83% → 97%**, and on GPT‑5 from **98% → 100%**, surpassing all existing black-box LVLM attacking methods.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 12825
Loading