Faster-VPS: Accelerating Object-Level Interpretation of Multimodal Foundation Models

Zihan Gu; Ruoyu Chen; Junchi Zhang; Han Zhang; Hua Zhang; Yue Hu; Xiaochun Cao

Faster-VPS: Accelerating Object-Level Interpretation of Multimodal Foundation Models

Zihan Gu, Ruoyu Chen, Junchi Zhang, Han Zhang, Hua Zhang, Yue Hu, Xiaochun Cao

09 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Explainable AI; attribution; visual grounding; multimodal

TL;DR: Our paper proposes an accelerated algorithm for visual and multimodal attribution tasks, which reduces the time by about 90% while maintaining most of the performance.

Abstract: Attribution is essential for interpreting object-level foundation models, yet existing methods struggle with the trade-off between efficiency and faithfulness. Gradient-based approaches are efficient but imprecise, while perturbation-based approaches achieve high fidelity at prohibitive cost. Visual Precision Search (VPS) represents the current state-of-the-art, but its greedy search requires a quadratic number of forward passes, severely limiting practicality. We introduce Faster-VPS, which replaces VPS’s greedy search with a novel Phase-Window (PhaseWin) algorithm. PhaseWin combines phased pruning, windowed fine-grained selection, and adaptive control mechanisms to approximate greedy attribution with near-linear complexity. Theoretically, Faster-VPS retains approximation guarantees under monotonous submodular conditions. Empirically, it achieves over 95\% of VPS’s faithfulness using only 20\% of the computational budget, and consistently outperforms all other attribution baselines on tasks such as object detection and visual grounding with Grounding DINO and Florence-2. Faster-VPS thus establishes a new state-of-the-art in efficient and faithful attribution.

Primary Area: interpretability and explainable AI

Submission Number: 3472

Loading