Adaptive Masking Enhances Visual Grounding

ICLR 2026 Conference Submission16645 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Interpretative Masking, Visual Grounding, Low-shot Learning
Abstract: Humans excel at recognizing objects under incomplete information by focusing on their most salient features. Inspired by this capability, we present **IMAGE** (**I**nterpretative **MA**sking with **G**aussian Radiation Mod**E**ling), a novel training paradigm for visual grounding that selectively obscures salient regions. By compelling models to infer objects from suboptimal cues, IMAGE mimics human adaptability in scenarios where critical features are absent. We propose a progressive training strategy that gradually increases the masking ratio, compelling the model to extract essential object attributes rather than memorizing all possible features. Experiments on standard visual grounding benchmarks demonstrate notable improvements in zero-shot and low-shot scenarios, with IMAGE seamlessly integrating into existing architectures. The method’s training-only operation ensures zero added computational cost during deployment, offering a practical pathway toward robust, data-efficient visual grounding.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 16645
Loading