What are image captions made of?


Nov 07, 2017 (modified: Nov 07, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: This paper focuses on the `image' side of image captioning. We investigate why end-to-end neural image captioning systems seemingly work so well, and how they can be improved by better utilizing different image representations in an informed manner. In this paper, we study the properties of different types of image representations and how they affect the performance of end-to-end image captioning models. Our empirical analysis provides interesting insights into the representational properties and suggests that the model implicitly learns a `visual-semantic' sub-space. We also provide insights into the generalization capabilities of the model. Our analysis specially focuses on interpreting the discriminative quality of the feature representations, some properties of the induced space and uniqueness of generated image captions. Our results suggest that explicitly modeling the presence of objects and basic object interactions in necessary for tasks that require semantic understanding and better generalization.
  • TL;DR: This paper presents an empirical analysis on the role of different types of image representations and probes the properties of these representations for the task of image captioning.
  • Keywords: image captioning, representation learning, interpretability, rnn, multimodal, vision to language