GRAID: Enhancing Spatial Reasoning of VLMs through High-Fidelity Data Generation

Karim Elmaaroufi; Liheng Lai; Justin Svegliato; Yutong Bai; Sanjit A. Seshia; Matei Zaharia

GRAID: Enhancing Spatial Reasoning of VLMs through High-Fidelity Data Generation

Karim Elmaaroufi, Liheng Lai, Justin Svegliato, Yutong Bai, Sanjit A. Seshia, Matei Zaharia

Published: 02 Mar 2026, Last Modified: 13 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0

Track: long paper (up to 8 pages)

Keywords: Vision Language Models, Spatial Reasoning, Synthetic Data Generation, Human Validation, Fine-tuning, Multimodal Learning

TL;DR: A framework that generates high-quality Spatial Reasoning VQA datasets. Human evals show validity rates in the 90s and far above existing methods. Fine-tuning experiments show improvements over current methods on existing benchmarks.

Abstract: Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning---a prerequisite for applications such as medical imaging and robotics. We present GRAID, a data generation pipeline that generates high-fidelity spatial reasoning data from images through qualitative analysis of 2D geometry by using object detectors. By avoiding the use of single-image 3D reconstruction pipelines and generative hallucinations, GRAID produces datasets with higher accuracy, as confirmed by our human study. Crucially, we demonstrate that training on GRAID-generated QA pairs leads to learning transferable concepts and improved reasoning across general visual reasoning problems. We fine-tune several VLM families on GRAID data and compare against models tuned on data from current methods. We find that GRAID-tuned models result in significant accuracy gains in both spatial reasoning and general visual reasoning benchmarks such as BLINK, A-OKVQA, and RealWorldQA. GRAID is publicly available at [our website](https://ke7.github.io/graid/).

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 56

Loading