Keywords: VLM Efficiency, Question-Guided Sparse Attention, Gumbel-Softmax, Self-Supervised Grounding, Inference Optimization, Spatially-Grounded VQA, FLOPs Reduction
Abstract: Visual Question Answering (VQA) models process all image patches uniformly
despite questions typically requiring only a small subset of visual information.
This inefficiency leads to unnecessary computation and can result in attention
dilution across irrelevant image regions. We propose \textbf{Question-Guided
Sparse Attention (QGSA)}, a plug-and-play mechanism that dynamically selects
relevant image patches conditioned on question semantics. Our approach introduces
three components: (1)a differentiable patch selector based on Gumbel-Softmax
reparameterisation that enables end-to-end training with hard patch selection at
inference; (2)a self-supervised grounding loss that encourages spatial
selectivity without bounding-box annotations, combining contrastive patch
selection with patch--word alignment via a frozen CLIP encoder; and (3)an
adaptive sparsity mechanism that adjusts the number of selected patches according
to estimated question complexity. Experiments on SmolVLM-256M-Instruct and
SmolVLM-500M-Instruct across three VQA benchmarks (VQA-RAD, A-OKVQA, RefCOCO)
demonstrate that QGSA reduces cross-attention FLOPs by 91--99\% across input
resolutions, achieving up to $76\times$ theoretical speedup at 576px resolution, while
maintaining \emph{exact} accuracy parity with the dense baseline ($\Delta=0.0$\,pp
on all datasets).
Wall-clock parity with the dense baseline is reached at 336px; realised
end-to-end speedup requires larger models where cross-attention dominates total
compute. QGSA consistently selects an average of $k\approx17$ patches out of
576 (256M model), up to $k\approx18$ (500M model), yielding up to a $34\times$
reduction in the visual token sequence. These small-scale results validate the
feasibility of question-conditioned sparse attention and provide a foundation for
scaling to larger VLMs.
Submission Number: 42
Loading