Abstract: Visual Question Answering (VQA) models have struggled with counting objects in natural images so far. We identify a fundamental problem due to soft attention in these models as a cause. To circumvent this problem, we propose a neural network component that allows robust counting from object proposals. Experiments on a toy task show the effectiveness of this component and we obtain state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with our single model. On a difficult balanced pair metric, the component gives a substantial improvement in counting over a strong baseline by 6.6%.
TL;DR: Enabling Visual Question Answering models to count by handling overlapping object proposals.
Keywords: visual question answering, vqa, counting
Code: [![github](/images/github_icon.svg) Cyanogenoid/vqa-counting](https://github.com/Cyanogenoid/vqa-counting)
Data: [CLEVR](https://paperswithcode.com/dataset/clevr), [Visual Question Answering](https://paperswithcode.com/dataset/visual-question-answering), [Visual Question Answering v2.0](https://paperswithcode.com/dataset/visual-question-answering-v2-0)