GRAID: Enhancing Spatial Reasoning of VLMs through High-Fidelity Data Generation

Published: 02 Mar 2026, Last Modified: 13 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0
Track: long paper (up to 8 pages)
Keywords: Vision Language Models, Spatial Reasoning, Synthetic Data Generation, Human Validation, Fine-tuning, Multimodal Learning
TL;DR: A framework that generates high-quality Spatial Reasoning VQA datasets. Human evals show validity rates in the 90s and far above existing methods. Fine-tuning experiments show improvements over current methods on existing benchmarks.
Abstract: Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning---a prerequisite for applications such as medical imaging and robotics. We present GRAID, a data generation pipeline that generates high-fidelity spatial reasoning data from images through qualitative analysis of 2D geometry by using object detectors. By avoiding the use of single-image 3D reconstruction pipelines and generative hallucinations, GRAID produces datasets with higher accuracy, as confirmed by our human study. Crucially, we demonstrate that training on GRAID-generated QA pairs leads to learning transferable concepts and improved reasoning across general visual reasoning problems. We fine-tune several VLM families on GRAID data and compare against models tuned on data from current methods. We find that GRAID-tuned models result in significant accuracy gains in both spatial reasoning and general visual reasoning benchmarks such as BLINK, A-OKVQA, and RealWorldQA. GRAID is publicly available at [our website](https://ke7.github.io/graid/).
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 56
Loading