GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering

Yi-Ting Li, Ying-Jia Lin, Chia-Jen Yeh, Chun-Yi Lin, Hung-Yu Kao

Published: 01 Jan 2024, Last Modified: 20 Feb 2025PAKDD (6) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The WSDM 2023 Toloka VQA challenge introduces a new Grounding-based Visual Question Answering (GVQA) dataset, elevating multimodal task complexity. This challenge diverges from traditional VQA by requiring models to identify a bounding box in response to an image-question pair, aligning with Visual Grounding tasks. Existing VG approaches, when applied to GVQA, often necessitate external data or larger models for satisfactory results, leading to high computational demands. We approach this as a language modeling problem, utilizing prompt tuning with multiple state-of-the-art VQA models. Our method, operating solely on an NVIDIA RTX3090 GPU without external data, secured third place in the challenge, achieving an Intersection over Union (IoU) of 75.658. Our model notably provides explainability between textual and visual data through its attention mechanism, offering insights into its decision-making process. This research demonstrates that high performance in GVQA can be achieved with minimal resources, enhancing understanding of model dynamics and paving the way for improved interpretability and efficiency. Our code is available here: https://github.com/IKMLab/GViG.git