Robust visual question answering via semantic cross modal augmentation

Published: 01 Jan 2024, Last Modified: 11 Jan 2025Comput. Vis. Image Underst. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•VQA models often confidently give incorrect answers to irrelevant questions.•We enhance model robustness at test-time through multi-modal semantic augmentation.•Proposed CMA creates varied inputs for models and merges predictions for stability.•CMA variants improve VQA reliability and performance in ambiguous environments.
Loading