LLM-Driven Data Augmentation for Visual Question Answering

Published: 2025, Last Modified: 10 Nov 2025JURSE 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Remote Sensing Visual Question Answering (RSVQA) is a task aiming at automatic answering questions related to overhead imagery. Many studies have been conducted in recent years, focusing on the methods and the data. However, a recurrent problem is the lack of generalization abilities and robustness to questions with similar semantics but different wording. This work focuses on the data part, specifically the questions. Our objective is to make RSVQA models more robust to various changes in questions, more generalizable (e.g. to unseen phrasing, synonyms) and less susceptible to bias in the data. To this end, we propose to leverage the abilities of Large Language Models (LLMs) in the field of natural language processing, to enrich a RSVQA dataset by generating new questions with the same meaning and semantics. To showcase the effectiveness of this process we compare and confront a baseline, relying on back translation, and the proposed LLM-based approach on an urban dataset (RSVQA-HR). Our experimental study, with quantitative evaluation performances, highlights that models trained with the proposed data augmentation scheme are indeed more robust to unseen questions.
Loading