C2F-Space: Coarse-to-Fine Space Grounding for Spatial Instructions using Vision-Language Models

C2F-Space: Coarse-to-Fine Space Grounding for Spatial Instructions using Vision-Language Models

ACL ARR 2025 July Submission1202 Authors

29 Jul 2025 (modified: 03 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Space grounding refers to localizing spatial references expressed through natural language instructions. Traditional methods often fail to account for complex reasoning$\textemdash$ such as distance, geometry, and inter-object relationships$\textemdash$ while vision-language models (VLMs), despite strong reasoning abilities, struggle to produce fine-grained outputs. To overcome these limitations, we propose C2F-Space, a novel coarse-to-fine space-grounding framework that performs coarse reasoning via propose-validate VLM prompting and refines predictions through superpixel-wise residual learning for precise local geometric reasoning. Our evaluations demonstrate that C2F-Space significantly outperforms three state-of-the-art baselines in both success rate and intersection-over-union on a new superpixel-level space-grounding benchmark.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: spoken language grounding, image text matching

Languages Studied: English

Submission Number: 1202

Loading