The Continuous Space Gap: Why VLMs Fail in Continuous Geometric Reasoning

Yikun Zong; Cheston Tan

The Continuous Space Gap: Why VLMs Fail in Continuous Geometric Reasoning

Yikun Zong, Cheston Tan

Published: 02 Mar 2026, Last Modified: 03 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0

Keywords: Vision–Language Models, continuous geometric reasoning, Tangram, test-time self-improvement, reward-guided refinement, spatial reasoning

TL;DR: Current vision–language models lag far behind humans on continuous Tangram geometric reasoning, and even with reward-guided test-time self-improvement they still cannot close the continuous-space gap.

Abstract: This paper presents a surprising negative result: despite their success in discrete reasoning, Vision–Language Models (VLMs) fail catastrophically in continuous geometric reasoning, achieving only 0.41 IoU on single-piece tasks and 0.23 on two-piece composition, far below human performance. Humans can complete tangram tasks even in childhood, demonstrating significantly high continuous spatial reasoning ability (Bohning & Althouse, 1997). Comprehensive experiments across state-of-the-art VLMs (GPT-4o, Gemini, Claude, Qwen, LLaMA) show that while test-time self-improvement through reward-guided refinement loops does improve predictions (0.63→0.93 IoU on single-piece cases), this refinement is far from sufficient to close the gap: even after self-improvement, performance remains below human level, gains do not reliably generalize, and multi-piece tasks would face even greater challenges even with refinement. Thus, our negative result targets VLMs’ continuous-space reasoning ability, not the existence of test-time refinement itself. We posit five underlying limitations, training distribution mismatch, output format constraints treating coordinates as text strings, visual encoder geometric invariance, positional embedding precision limits, and absence of geometry-aware feedback and inductive biases and document boundary conditions where refinement helps (single-piece tasks) but saturates within 6 iterations, indicating systematic rather than correctable errors

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 23

Loading