Abstract: Space grounding refers to localizing spatial references expressed through natural language instructions. Traditional methods often fail to account for complex reasoning$\textemdash$ such as distance, geometry, and inter-object relationships$\textemdash$ while vision-language models (VLMs), despite strong reasoning abilities, struggle to produce fine-grained outputs. To overcome these limitations, we propose C2F-Space, a novel coarse-to-fine space-grounding framework that performs coarse reasoning via propose-validate VLM prompting and refines predictions through superpixel-wise residual learning for precise local geometric reasoning. Our evaluations demonstrate that C2F-Space significantly outperforms three state-of-the-art baselines in both success rate and intersection-over-union on a new superpixel-level space-grounding benchmark.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: spoken language grounding, image text matching
Languages Studied: English
Submission Number: 1202
Loading