Keywords: visual grounding, Cross-modal Alignment Priors, Target-aware Query Generation, Scale Adaptability
Abstract: Visual Grounding links textual descriptions to the corresponding image regions, and its complexity increases with target semantic complexity. Existing methods encounter performance bottlenecks due to semantic alignment bias and scale-induced perception mismatch. In this paper, we propose ASVG, an efficient framework that exploits alignment priors from the cross-modal encoder to build target-aware queries and enhances scale adaptability through progressive cross-scale reasoning. First, we design an alignment prior-guided query generator, which embeds text-conditioned visual heatmaps into object queries to enhance their semantic discriminability. Second, we develop a progressive cross-scale decoder that builds a multi-resolution pyramid solely from single-scale features, enabling progressive cross-scale reasoning while avoiding redundant feature-pyramid fusion. In addition, we introduce a lightweight token branch and Soft Cross-head Distillation (SCD), which enforces feature consistency and adaptively reweights losses, reducing inference cost while maintaining high performance. Our method achieves significant performance gains across six VG and GREC datasets, particularly under complex or ambiguous target semantics.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 9141
Loading