Efficient Visual Grounding in VQA via Question-Guided Sparse Attention

Prasanth

Efficient Visual Grounding in VQA via Question-Guided Sparse Attention

Prasanth

Published: 05 May 2026, Last Modified: 12 May 20264th ALVR SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: VLM Efficiency, Question-Guided Sparse Attention, Gumbel-Softmax, Self-Supervised Grounding, Inference Optimization, Spatially-Grounded VQA, FLOPs Reduction

Abstract: Visual Question Answering (VQA) models process all image patches uniformly despite questions typically requiring only a small subset of visual information. This inefficiency leads to unnecessary computation and can result in attention dilution across irrelevant image regions. We propose \textbf{Question-Guided Sparse Attention (QGSA)}, a plug-and-play mechanism that dynamically selects relevant image patches conditioned on question semantics. Our approach introduces three components: (1)a differentiable patch selector based on Gumbel-Softmax reparameterisation that enables end-to-end training with hard patch selection at inference; (2)a self-supervised grounding loss that encourages spatial selectivity without bounding-box annotations, combining contrastive patch selection with patch--word alignment via a frozen CLIP encoder; and (3)an adaptive sparsity mechanism that adjusts the number of selected patches according to estimated question complexity. Experiments on SmolVLM-256M-Instruct and SmolVLM-500M-Instruct across three VQA benchmarks (VQA-RAD, A-OKVQA, RefCOCO) demonstrate that QGSA reduces cross-attention FLOPs by 91--99\% across input resolutions, achieving up to $76\times$ theoretical speedup at 576px resolution, while maintaining \emph{exact} accuracy parity with the dense baseline ($\Delta=0.0$\,pp on all datasets). Wall-clock parity with the dense baseline is reached at 336px; realised end-to-end speedup requires larger models where cross-attention dominates total compute. QGSA consistently selects an average of $k\approx17$ patches out of 576 (256M model), up to $k\approx18$ (500M model), yielding up to a $34\times$ reduction in the visual token sequence. These small-scale results validate the feasibility of question-conditioned sparse attention and provide a foundation for scaling to larger VLMs.

Submission Number: 42

Loading