On diversity in image captioning: metrics and methods

Qingzhong Wang, Jia Wan, Antoni B. Chan

26 Apr 2021OpenReview Archive Direct UploadReaders: Everyone

Abstract: In this paper, we first propose a metric to measure the diversity of a set of captions, which is derived from latent semantic analysis (LSA), and then kernelize LSA using CIDEr similarity. Compared with mBLEU, our proposed diversity metrics show a relatively strong correlation to human evaluation. We conduct extensive experiments, finding that the models that aim to generate captions with higher CIDEr scores normally obtain lower diversity scores, which generally learn to describe images using common words. To bridge this “diversity” gap, we consider several methods for training caption models to generate diverse captions. First, we show that balancing the cross-entropy loss and CIDEr reward in reinforcement learning during training can effectively control the tradeoff between diversity and accuracy. Second, we develop approaches that directly optimize our diversity metric and CIDEr score using reinforcement learning. Third, we combine accuracy and diversity into a single measure using an ensemble matrix and then maximize the determinant of the ensemble matrix via reinforcement learning to boost diversity and accuracy, which outperforms its counterparts on the oracle test. Finally, we develop a DPP selection algorithm to select a subset of captions from a large number of candidate captions.

0 Replies