Multimodal learning with feature fusion transformer for image captioning

Published: 01 Jan 2025, Last Modified: 16 Nov 2025Displays 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Most existing methods focus on fusing multimodal features to produce image descriptions, but they cannot fully explore correlations across modalities. To improving accuracy, we propose a Multimodal learning method with Feature Fusion Transformer (MFFT), which facilitates feature fusion within the same modality and enhances feature alignment between different modalities. To effectively enhance the cross-modal analysis capability of different features, we introduce cross memory attention mechanisms to fully exchange visual and textual information. By designing learnable memory vectors in encoders and decoders, we can efficiently align different features with each other to alleviate the representation differences across modalities. Thus we propose a Cross Memory Encoding block (CME) by employing trainable memory vectors to merge region and grid features, and a Cross Memory Decoding block (CMD) by utilizing mixed cross-attention and trainable global memory vectors to learn the prior distribution of linguistic features from texts. Additionally, we propose a novel pre-training strategy to align multiple features from different modalities for mitigating the intrinsic differences between images and texts. This strategy improves the cross-modal presentation capability in both encoders and decoders. Results of extensive experiments on the MSCOCO dataset highlight that our proposed SFFT yields superior performance compared to several state-of-the-art methods.
Loading