Learning cross-modality features for image caption generation

Chao Zeng, Sam Kwong

Published: 01 Jan 2022, Last Modified: 05 Aug 2025Int. J. Mach. Learn. Cybern. 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Image captioning is a challenging task in the research area of vision and language. Traditionally in a deep learning-based image captioning model, two types of input features are utilized for generating the token of the current inference step, including the attended visual feature and the previous word embedding. However, the sentence level embeddings are ignored in this typical working pipeline for captioning. In this paper, we propose Intrinsic Cross-Modality Captioning (ICMC), a new method to improve image captioning with sentence level embedding and Cross-Modality Alignment. The novelty of our proposed model mainly comes from the text encoder and the Cross-Modality module. In the feature encoding stage, we use an adaptation module to map the global visual features to the joint domain. In the decoding stage we then use the adapted features to guide the visual attention process with the RCNN features. With the proposed method we not only attend to the visual features and previous word for captions but also include the sentence level clues from the ground truths at training phase. The evaluation on the benchmark of MSCOCO and extensive ablation studies are performed to validate the effectiveness of the proposed method.

External IDs:dblp:journals/mlc/ZengK22