ObjectNet Captions: Models are not superhuman captioners

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Supplementary Material: zip
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Deep learning, Representation learning, Computer vision Datasets
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Even on out-of-domain image captioning datasets such as nocaps, models often outperform humans according to captioning metrics like CIDEr. Yet, in real world conditions, model captions are often wrong. We demonstrate that this performance deficit exists by introducing a new dataset and a new captioning metric. We introduce a new dataset, called ObjectNet Captions, that reduces spurious correlations which machines often exploit. We show the shortcomings of current captioning metrics with a head-to-head experiment against humans, where we find that humans rate human-generated captions as being of much higher quality than machine captions. Driven by this, we introduce HUMANr, a new, highly robust, easy to replicate, and consistent metric, computed from head-to-head comparisons, which can be crowdsourced at low cost. We also develop tooling to automatically compute HUMANr. HUMANr is an absolute performance metric: driving it to 0 means that humans can no longer distinguish machine captions from human captions. No current metric provides such a fixed target to aim for along with knowledge of when captioning is solved in this sense. Moreover, HUMANr can reveal that humans still outperform machines, which no current metric is able to demonstrate. Existing metrics both overstate the performance of machine models and, at the same time, they inherently limit it. While most current metrics are saturated, HUMANr provides significant opportunities for further captioning research, thereby opening the door to new advances. ObjectNet Captions and HUMANr are made available to the research community.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4705
Loading