Dynamic window sampling strategy for image captioning

Published: 01 Jan 2025, Last Modified: 11 Apr 2025Eng. Appl. Artif. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The task of image captioning aims to transform the visual information from images into semantically accurate and grammatically correct textual descriptions. In this paper, we concentrate on improving the word sampling process to enhance the training effectiveness of image captioning models. We first demonstrate that using beam search to sample sentences during reinforcement learning training performs worse than the probability-based sampling method. We then find that the probability-based sampling is prone to selecting inaccurate words due to an unstable probability distribution. Since current mainstream reinforcement learning training relies on sentence-level feedback, it is difficult to perceive word-level effects. Consequently, irrelevant words can interfere with learning the entire sentence. To address this issue, we propose a dynamic window sampling strategy for image captioning. The core idea is to dynamically determine the word candidate pool, i.e., the sampling window, based on overall prediction confidence. Compared to sampling words over the entire vocabulary, our approach maintains diversity while avoiding the sampling of irrelevant words as much as possible. Extensive experiments on benchmark datasets show that the proposed dynamic window sampling strategy can significantly improve model performance. Specifically, our method achieved CIDEr scores of 144.9% (single model) and 147.9% (ensemble of 4 models) on the offline test, and 142.1% (c5) and 144.0% (c40) on the official online test server. Overall, this paper contributes to artificial intelligence research by studying sampling strategies to improve image captioning performance. The source code is available at https://github.com/792218/DWSS.
Loading