GRAID: Enhancing Spatial Reasoning of VLMs through High-Fidelity Data Generation

Published: 02 Mar 2026, Last Modified: 13 Mar 2026ES-Reasoning @ ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision Language Models, Spatial Reasoning, Synthetic Data Generation, Human Validation, Fine-tuning, Multimodal Learning
TL;DR: A framework that generates high-quality Spatial Reasoning VQA datasets. Human evals show validity rates in the 90s and far above existing methods. Fine-tuning experiments show improvements over current methods on existing benchmarks.
Abstract: Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning$\textemdash{}$a prerequisite for applications such as medical imaging and robotics. We present GRAID, a data generation pipeline that generates high-fidelity spatial reasoning data from images through qualitative analysis of 2D geometry by using object detectors. By avoiding the use of single-image 3D reconstruction pipelines and generative hallucinations, GRAID produces datasets with higher accuracy, as confirmed by our human study. Crucially, we demonstrate that training on GRAID generated QA pairs leads to learning transferable concepts and improved reasoning across general visual reasoning problems. We fine-tune several VLM families on GRAID data and compare against models tuned on data from current methods. We find that GRAID-tuned models result in significant accuracy gains in both spatial reasoning and general visual reasoning benchmarks such as BLINK, A-OKVQA, and RealWorldQA. GRAID is publicly available at [our website](https://ke7.github.io/graid/).
Submission Number: 26
Loading