Abstract: Visual reasoning is a key capability that significantly impacts the performance of multimodal tasks, such as compositional visual question answering and visual grounding. These tasks often require complex, multi-step reasoning processes. In recent years, several training-free methods for Vision-Language Models (VLMs) have emerged, with visual programming methods being proposed to enhance the capability of VLMs in visual reasoning tasks. While these methods have made some progress, they still face two primary challenges due to the lack of verification and refinement mechanisms for each action's output during the reasoning process: error accumulation and feedback delay, as well as insufficient utilization of multimodal contextual information. To address these challenges, we propose VL-DynaRefine, a training-free approach consisting of three modules: a planner, a verifier, and a refiner. The planner generates programmatic actions to solve the problem and executes each action in sequence, which is inspected by a verifier that reassesses the actions via confidence scores and determines whether refinement is necessary based on the evaluation results. In the refiner module, we incorporate a context-aware local refinement mechanism and a global refinement mechanism based on visual and action trajectories to reduce the impact of reasoning errors on the outcome. We evaluate our approach on multiple visual reasoning datasets, and the experimental results show that our method outperforms existing visual programming methods in both reasoning accuracy and efficiency, further validating its effectiveness in visual reasoning tasks.
External IDs:doi:10.1145/3746027.3755296
Loading