- Abstract: Recurrent neural networks can be trained to describe images with natural language, but it has been observed that they generalize poorly to new scenes at test time. Here we provide an experimental framework to quantify their generalization to unseen compositions. By describing images using short structured representations, we tease apart and evaluate separately two types of generalization: (1) generalization to new images of similar scenes, and (2) generalization to unseen compositions of known entities. We quantify these two types of generalization by a large-scale experiment on the MS-COCO dataset with a state-of-the-art recurrent network, and compare to a baseline structured prediction model on top of a deep network. We find that a state-of-the-art image captioning approach is largely "blind" to new combinations of known entities (~2.3% precision@1), and achieves statistically similar precision@1 to that of a considerably simpler structured-prediction model with much smaller capacity. We therefore advocate using compositional generalization metrics to evaluate vision and language models, since generalizing to new combinations of known entities is key for understanding complex real data.
- TL;DR: For image captioning, we propose an experimental framework to evaluate generalization to unseen compositions of known entities, showing that state-of-the-art captioning approach generalize very poorly to new compositions.
- Keywords: Computer vision, Natural language processing, Deep learning, Supervised Learning, Transfer Learning, Multi-modal learning, Structured prediction
- Conflicts: google.com, tau.ac.il, biu.ac.il