TangramSR: A Benchmark for Recursive Self-Improvement In Continuous Geometric Reasoning

Published: 05 Mar 2026, Last Modified: 09 Mar 2026ICLR 2026 Workshop RSI PosterEveryoneRevisionsCC BY 4.0
Keywords: Vision-language models, self-improving AI, test-time self-improvement, continuous geometric reasoning, Tangram benchmark, spatial reasoning
TL;DR: A Tangram-based benchmark and recursive test-time refinement framework that reveal and mitigate severe failures of vision–language models in continuous geometric reasoning without any retraining.
Abstract: ision–Language Models (VLMs) have achieved remarkable success on discrete multimodal benchmarks, yet struggle with continuous geometric reasoning tasks that require precise spatial alignment. This paper addresses a fundamental chal- lenge in self-improving AI: how can models iteratively refine their predictions at test time without parameter updates? We introduce a test-time self-refinement framework that combines in-context learning with reward-guided feedback loops to enable VLMs to improve geometric alignment through iterative corrections. Our approach operates on Tangram puzzle assembly, a mathematically rigorous, NP-hard shape arrangement task requiring precise estimation of position, rotation, and scale. We establish a continuous-space evaluation benchmark that decom- poses geometric reasoning into factorized subtasks (position, angle, size) and mea- sures performance using ℓ2 distance and polygonal intersection-over-union (IoU). Comprehensive experiments across five representative VLMs reveal systematic performance gaps (average IoU 0.41 on single-piece tasks, dropping to 0.23 on two-piece composition). Our training-free verifier–refiner agent applies recur- sive refinement loops that iteratively self-refine predictions based on geometric consistency feedback. Starting from initial predictions with low IoU (0.63), the recursive loop progressively improves geometric alignment through multiple it- erations, achieving IoU of 0.932 on medium-triangle cases without any model retraining. This demonstrates that recursive self-improvement can substantially enhance geometric reasoning in VLMs, moving self-improving AI from promise to practice in continuous spatial domains.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 19
Loading