Abstract: In Visual Question Answering (VQA) tasks, program-driven reasoning methods have advanced by transforming solutions into executable code. However, existing approaches often struggle due to their reliance on a single code generation iteration, which lacks the adaptability to handle unforeseen errors. To address this challenge, we introduce the Self-Enhancing Programming-driven Reasoning framework for VQA (Seper). Seper employs large language models (LLMs) to decompose questions into multistep instructions and dynamically generates Python code using a code generator. It also incorporates a code evaluator that performs both forward and backward evaluations, initiating an iterative code regeneration process for continuous optimization. Additionally, we introduce prompt tuning to enhance the quality of the generated code. Our experiments on the GQA and OK-VQA datasets show that Seper outperforms existing methods, demonstrating its potential to advance VQA programming approaches. Code: https://anonymous.4open.science/r/Seper-5540/
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: code generation and understanding; multimodal applications
Contribution Types: NLP engineering experiment, Reproduction study, Data analysis
Languages Studied: English
Submission Number: 2014
Loading