Keywords: visual programming, spatial reasoning, tool abstraction
Abstract: The composition of specialized tools offers a powerful approach for complex visual reasoning, particularly for tasks involving 3D spatial understanding. However, existing visual programming methods are often constrained by fixed toolsets or offline tool induction, which leads to suboptimal solutions and poor tool reuse. We introduce Transductive Visual Programming (TVP), a novel framework that dynamically evolves a library of reusable tools by learning from its problem-solving experience.
TVP abstracts recurring solution patterns into new, higher-level tools, which are then used to construct simpler and more effective programs for new tasks. On the challenging Omni3D-Bench, TVP establishes a new state of the art, outperforming both specialized vision-language models and prior visual programming systems. The evolved tools also exhibit strong generalization to out-of-domain queries on 3DSRBench, SpatialSense, and VGBench. Our work demonstrates that transductive tool evolution is a powerful and generalizable paradigm for building robust visual reasoning systems.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 23479
Loading