Abstract: Highlights•Proposal of a generic Linguistically-aware Attention (LAT) to reduce the semantic-gap between the modalities in Vision-language tasks.•Proposal of a novel Counting-VQA model that shows state-of-the-art results in five counting-specific VQA datasets.•Adaptation of LAT into various state-of-the-art VQA models such as UpDn, MUREL and BAN. In all the models LAT improves the performance.•Adaptation of LAT into the best performing object-level attention based captioning model (UpDn model). Incorporation of LAT improves the captioning performance of the baseline model.
Loading