Abstract: Contextualized image captioning is a task that extends beyond generating a purely visual description of the image content and aims to produce a caption that is influenced by the context and informed by the real world knowledge. In this paper, we present an approach to knowledge-aware image captioning, with a specific focus on the temporal domain. We propose a way to identify relevant information in external data sources, such as geographic databases and common knowledge bases, and then encode it in a way that is most useful for the captioning network. We develop an end-to-end caption generation system that incorporates external knowledge into the captioning process at several stages. The system is trained and tested on our novel temporal knowledge-aware captioning dataset, achieving significant improvements over multiple baselines across standardly used metrics. We demonstrate that our approach is effective for generating highly contextualized captions with both relevant and accurate temporal facts.
0 Replies
Loading