Linguistically-aware attention for reducing the semantic gap in vision-language tasks

Gouthaman KV, Athira M. Nambiar, Kancheti Sai Srinivas, Anurag Mittal

Published: 01 Jan 2021, Last Modified: 11 Oct 2025Pattern Recognit. 2021EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•Proposal of a generic Linguistically-aware Attention (LAT) to reduce the semantic-gap between the modalities in Vision-language tasks.•Proposal of a novel Counting-VQA model that shows state-of-the-art results in five counting-specific VQA datasets.•Adaptation of LAT into various state-of-the-art VQA models such as UpDn, MUREL and BAN. In all the models LAT improves the performance.•Adaptation of LAT into the best performing object-level attention based captioning model (UpDn model). Incorporation of LAT improves the captioning performance of the baseline model.