Distilled Cross-Combination Transformer for Image Captioning with Dual Refined Visual Features

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Transformer-based encoders that encode both region and grid features are the preferred choice for the image captioning task due to their multi-head self-attention mechanism. This mechanism ensures superior capture of relationships and contextual information between various regions in an image. However, because of the Transformer block stacking, self-attention computes the visual features several times, increasing computing costs and producing a great deal of redundant feature calculation. In this paper, we propose a novel Distilled Cross-Combination Transformer (DCCT) network. Specifically, we first design a distillation cascade fusion encoder(DCFE) to filter out redundant features in visual features that affect attentional focus, obtaining refined features. Additionally, we introduce a parallel cross-fusion attention module (PCFA) that fully utilizes the complementarity and correlation between grid and region features to better fuse the encoded dual visual features. Extensive experiments on the MSCOCO dataset demonstrate that the proposed DCCT strategy outperforms many state-of-the-art techniques and attains exceptional performance.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: First, we present a Distillation Cascade Fusion Encoder (DCFE), which enhances encoding efficiency by filtering out redundant features from the images to produce more refined visual representations. Second, we introduce a novel Parallel Cross-modal Fusion Attention (PCFA) module that fully exploits the complementarity and correlation between dual visual data to obtain more informative multimodal feature representations. Finally, extensive experiments on the benchmark MS COCO dataset show that our suggested DCCT outperforms the most advanced methods, achieving an exceptional performance of 144.1 in the ensemble configuration. Therefore, DCCT contributes significantly to multimedia/multi-modal processing in image description tasks through innovations in dual visual feature extraction and fusion, cross-modal information fusion, and distillation and cascade fusion.
Submission Number: 2825
Loading