Keywords: Robust VQA, Language Biases, Visual Question Answering.
TL;DR: We propose a novel training strategy called Loss Rebalancing Label and Global Context (LRLGC) to mitigate language priors in visual question answering.
Abstract: Despite the advances in Visual Question Answering (VQA), many VQA models currently suffer from language priors (i.e. generating answers directly from questions without using images), which severely reduces their robustness in real-world scenarios. We propose a novel training strategy called Loss Rebalancing Label and Global Context (LRLGC) to alleviate the above problem. Specifically, the Loss Rebalancing Label (LRL) is dynamically constructed based on the degree of sample bias to accurately adjust losses across samples and ensure a more balanced form of total losses in VQA. In addition, the Global Context (GC) provides the model with valid global information to assist the model in predicting answers more accurately. Finally, the model is trained through an ensemble-based approach that retains the beneficial effects of biased samples on the model while reducing their importance. Our approach is model-agnostic and enables end-to-end training. Extensive experimental results show that LRLGC (1) improves performance for various VQA models and (2) performs competitively in the VQA-CP v2 benchmark test.
Other Supplementary Material: zip