Keywords: Vision-language models, self-improving AI, test-time self-improvement, continuous geometric reasoning, Tangram benchmark, spatial reasoning
TL;DR: A Tangram-based benchmark and recursive test-time refinement framework that reveal and mitigate severe failures of vision–language models in continuous geometric reasoning without any retraining.
Abstract: ision–Language Models (VLMs) have achieved remarkable success on discrete
multimodal benchmarks, yet struggle with continuous geometric reasoning tasks
that require precise spatial alignment. This paper addresses a fundamental chal-
lenge in self-improving AI: how can models iteratively refine their predictions at
test time without parameter updates? We introduce a test-time self-refinement
framework that combines in-context learning with reward-guided feedback loops
to enable VLMs to improve geometric alignment through iterative corrections.
Our approach operates on Tangram puzzle assembly, a mathematically rigorous,
NP-hard shape arrangement task requiring precise estimation of position, rotation,
and scale. We establish a continuous-space evaluation benchmark that decom-
poses geometric reasoning into factorized subtasks (position, angle, size) and mea-
sures performance using ℓ2 distance and polygonal intersection-over-union (IoU).
Comprehensive experiments across five representative VLMs reveal systematic
performance gaps (average IoU 0.41 on single-piece tasks, dropping to 0.23 on
two-piece composition). Our training-free verifier–refiner agent applies recur-
sive refinement loops that iteratively self-refine predictions based on geometric
consistency feedback. Starting from initial predictions with low IoU (0.63), the
recursive loop progressively improves geometric alignment through multiple it-
erations, achieving IoU of 0.932 on medium-triangle cases without any model
retraining. This demonstrates that recursive self-improvement can substantially
enhance geometric reasoning in VLMs, moving self-improving AI from promise
to practice in continuous spatial domains.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 19
Loading