Abstract: Visual programming, a modular and generalizable paradigm, integrates different modules and Python operators to solve various vision-language tasks. Unlike end-to-end models that need task-specific data, it advances in performing visual processing and reasoning in an unsupervised manner. Current visual programming methods generate programs in a single pass for each task where the ability to evaluate and optimize based on feedback, unfortunately, is lacking, which consequentially limits their effectiveness for complex, multi-step problems. Drawing inspiration from benders decomposition, we introduce De-fine, a training-free framework that automatically decomposes complex tasks into simpler subtasks and refines programs through auto-feedback. This model-agnostic approach can improve logical reasoning performance by integrating the strengths of multiple models. Our experiments across various visual tasks show that De-fine creates more accurate and robust programs, setting new benchmarks in the field. The anonymous project is available at https://anonymous.4open.science/r/De-fine_Program-FE15
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: We revisit visual programming as a task of modular programming and optimization through feedback, solving them by the software engineering principles. With De-fine, we break down tasks into executable program blocks and refine them automatically using multifaceted feedback.
De-fine constructs an abstract logical prompt to sufficiently preserve the internal logical reasoning structure of the draft program and systematically defines four types of feedback to optimize the quality and performance of programs.
Without any supervised training data, De-fine achieves SOTA zero-shot performance on tasks such as image question answering, visual reasoning, and grounding.
Supplementary Material: zip
Submission Number: 2472
Loading