Keywords: Explainable AI; attribution; visual grounding; multimodal
TL;DR: Our paper proposes an accelerated algorithm for visual and multimodal attribution tasks, which reduces the time by about 90% while maintaining most of the performance.
Abstract: Attribution is essential for interpreting object-level foundation models, yet existing methods struggle with the trade-off between efficiency and faithfulness. Gradient-based approaches are efficient but imprecise, while perturbation-based approaches achieve high fidelity at prohibitive cost. Visual Precision Search (VPS) represents the current state-of-the-art, but its greedy search requires a quadratic number of forward passes, severely limiting practicality. We introduce Faster-VPS, which replaces VPS’s greedy search with a novel Phase-Window (PhaseWin) algorithm. PhaseWin combines phased pruning, windowed fine-grained selection, and adaptive control mechanisms to approximate greedy attribution with near-linear complexity. Theoretically, Faster-VPS retains approximation guarantees under monotonous submodular conditions. Empirically, it achieves over 95\% of VPS’s faithfulness using only 20\% of the computational budget, and consistently outperforms all other attribution baselines on tasks such as object detection and visual grounding with Grounding DINO and Florence-2. Faster-VPS thus establishes a new state-of-the-art in efficient and faithful attribution.
Primary Area: interpretability and explainable AI
Submission Number: 3472
Loading