Keywords: Large Language Models, Spatial Ability, Cognitive Benchmark
TL;DR: We introduce HST-bench, a theory-driven benchmark grounded in cognitive science to systematically evaluate LLMs’ spatial thinking abilities.
Abstract: Large language models (LLMs) show strong potential for real-world applications, yet their deployment in domains requiring deep interaction with the physical world hinges on robust spatial ability. Existing evaluations are constrained by a flawed, task-driven paradigm that probes surface-level perception, lacking the cognitive depth and theoretical guidance needed for true diagnostic precision.
To address this, we introduce HST-bench, a benchmark for Hierarchical Spatial Thinking that instigates a paradigm shift to theory-driven evaluation. Grounded in the National Research Council’s theory, HST-bench organizes assessment along three core cognitive dimensions: Representational Perception, Representational Transformation, and Spatial Reasoning. Spanning 1,629 problems across 10 sub-
dimensions, our tasks require dynamic operations such as coordinate transformation and symmetry, demanding deep spatial representation and reasoning. Comprehensive evaluations reveal that a “thinking” mechanism is critical for advanced spatial tasks. We further observe a strong positive correlation between general and spatial capabilities, and importantly, limited gains from multimodal inputs, highlighting the current primacy of reasoning over perception. HST-bench offers a principled, cognitively grounded path toward diagnosing and advancing the spatial intelligence of large models.
Primary Area: datasets and benchmarks
Submission Number: 9365
Loading