Keywords: Human cognition, AI reasoning, spatial cognition, Tangram, vision-language models, test-time refinement, cognitive modeling
TL;DR: Inspired by how humans solve Tangram puzzles, we build a spatial cognition benchmark showing a large human–VLM gap in continuous geometric reasoning, and substantially close this gap via a human-inspired iterative feedback refinement loop.
Abstract: Humans excel at spatial reasoning tasks like Tangram puzzle assembly through cognitive processes involving mental rotation, iterative refinement, and visual feedback. In contrast, Vision–Language Models (VLMs) struggle with continuous geometric reasoning despite their success on discrete benchmarks. This paper bridges human cognition and AI reasoning by introducing a framework that explicitly models human spatial cognitive capabilities and incorporates them into AI reasoning processes. We propose a human-inspired test-time refinement framework that mimics how humans iteratively correct spatial predictions through feedback-guided adjustments, combining in-context learning with reward-guided feedback loops. Experiments reveal that current VLMs achieve only 0.41 IoU on single-piece tasks (dropping to 0.23 on two-piece composition), far below human performance (≈0.98–1.00). Our human-inspired verifier–refiner agent applies reward-guided refinement loops that model human iterative correction processes, achieving IoU improvements from 0.63 to 0.932 without any model retraining, demonstrating that incorporating explicit models of human cognitive capabilities can substantially enhance AI reasoning in continuous spatial domains
Paper Type: New Short Paper
Supplementary Material: zip
Submission Number: 12
Loading