SpintBench: Evaluating LLMs' Complex \\ Reasoning via Spatial Integration Challenges

ICLR 2026 Conference Submission15706 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: complex reasoning, spatial integration, LLMs, benchmark, prompting
TL;DR: This paper introduces SpintBench, a benchmark for evaluating spatial reasoning in LLMs, highlighting their struggles with 2D and 3D spatial integration, despite modest improvements using few-shot learning and CoT prompting.
Abstract: Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains, yet their comprehensive spatial reasoning competencies remain underexplored. This paper proposes a benchmark construction framework for evaluating spatial reasoning in both 2D and 3D spaces—one that requires LLMs to infer global information from provided local details through spatial integration. Specifically, we have designed rules to automatically generate spatial descriptions of local scenes with overlapping cues, as well as corresponding question-answer (QA) pairs, forming the spatial integration reasoning benchmark SpintBench. Experimental results show that state-of-the-art (SOTA) LLMs still struggle to tackle \textbf{SpintBench} effectively: while the combination of few-shot learning and chain-of-thought (CoT) prompting yields modest performance improvements, these gains remain limited. This work is expected to provide valuable insights for advancing the investigation of spatial reasoning capabilities in LLMs.
Primary Area: datasets and benchmarks
Submission Number: 15706
Loading