SeaCap: Multi-Sight Embedding and Alignment for One-Stage Image Captioner

Published: 01 Jan 2025, Last Modified: 01 Aug 2025IEEE Trans. Multim. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recent mainstream image captioning methods usually adopt two-stage captioners, i.e., calculating the object features of the given image by a pre-trained detector and then feeding them into a language model to generate the descriptive sentences. However, such a two-stage procedure will lead to a task-based information gap that decreases the performance of the captioners, because the object features learned from the detection task are suboptimal representations and cannot provide all the necessary information for subsequent sentence generation. Besides, the object features are usually represented by the last pooling features of the detector that lose the local details of images. In this paper, we propose a novel One-Stage Image Captioner using dynamic multi-sight embedding and alignment, called SeaCap, which directly transforms input images into descriptive sentences in one stage to eliminate the information gap. Specifically, to obtain rich features, we use the Swin Transformer to capture multi-level features, followed by a sights alignment module to alleviate the vision confusion, and then feed them into a novel dynamic multi-sight embedding module to exploit both the global structure and local texture of input images. To enhance the global modeling capacity of the visual encoder, we propose a new dual-dimensional refining module to non-locally model the interaction of the embedded features. As a result, SeaCap can obtain rich and useful information to improve the performance of the captioner. Extensive comparisons on the benchmark MS-COCO, Flickr8K and Flickr30 K datasets verified the superior performance of our method.
Loading