Learning to Count Objects in Natural Images for Visual Question Answering

Yan Zhang; Jonathon Hare; Adam Prügel-Bennett

Learning to Count Objects in Natural Images for Visual Question Answering

Yan Zhang, Jonathon Hare, Adam Prügel-Bennett

15 Feb 2018 (modified: 22 Jun 2025)ICLR 2018 Conference Blind SubmissionReaders: Everyone

Abstract: Visual Question Answering (VQA) models have struggled with counting objects in natural images so far. We identify a fundamental problem due to soft attention in these models as a cause. To circumvent this problem, we propose a neural network component that allows robust counting from object proposals. Experiments on a toy task show the effectiveness of this component and we obtain state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with our single model. On a difficult balanced pair metric, the component gives a substantial improvement in counting over a strong baseline by 6.6%.

TL;DR: Enabling Visual Question Answering models to count by handling overlapping object proposals.

Keywords: visual question answering, vqa, counting

Code: [![github](/images/github_icon.svg) Cyanogenoid/vqa-counting](https://github.com/Cyanogenoid/vqa-counting)

Data: [CLEVR](https://paperswithcode.com/dataset/clevr), [Visual Question Answering](https://paperswithcode.com/dataset/visual-question-answering), [Visual Question Answering v2.0](https://paperswithcode.com/dataset/visual-question-answering-v2-0)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/learning-to-count-objects-in-natural-images/code)

17 Replies

Loading