Integrating Vision-Tool to Enhance Visual-Question-Answering in Special Domains

Published: 01 Jan 2024, Last Modified: 20 May 2025PRICAI (3) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Visual-Question-Answering (VQA) requires answering questions corresponding to visual information. Although pre-trained Vision-language models (VLMs) have obtained potential results on various VQA benchmarks, they show limitations adapted to VQA in special domains, which require specific vision and reasoning skills. While Large language models (LLMs) possess outstanding knowledge and reasoning skills, they cannot be applied in VQA due to the lack of vision support. We introduce a framework to enhance the performance of VLMs and enable the use of LLMs in special domain VQA. The framework leverages computer vision (CV) tools and pre-defined tool recipes to provide the models with the necessary information to solve the task. Along with the framework, we introduce three tool recipes for special VQA domains: (i) Visual Puzzle, (ii) Visual Arithmetic Reasoning, and (iii) Multilingual Scene-text. Experiments show that the proposed framework and tool recipes significantly outperform competitive VLMs on various tasks in both fine-tuning and few-shot approaches, establishing new state-of-the-art results.
Loading