Abstract: Recently, great progress has been made in promoting image captioning by improving Transformer structure. As a key component, dot product self-attention can update the representation of each feature vector in the visual encoder and guide the caption decoding process. However, the pairwise interaction in dot product self-attention enables attention weights to be learned at the instance or local level, making it difficult for attention module to obtain global feature representations. Furthermore, self-attention is always implemented in a multi-head fashion, where the calculation of each attention head is independent. It makes the model unable to exploit the complementary information contained in different heads. In this paper, we propose a Hadamard Product Perceptron Attention (HPPA) for image captioning, which introduces a more global feature interaction and incorporates interaction among attention heads to calculate attention results. Feature interaction method based on Hadamard product can integrate multimodal features more effectively than dot product and provide rich feature representation. Therefore, HPPA first utilizes Hadamard product to fuse the input features. Then, it generates a set of attention memory vectors containing global interaction features. The final attention weights are calculated via these vectors dynamically. When the multi-head mechanism is incorporated, the complementary information between different heads can be utilized by HPPA. We further integrate HPPA into Transformer encoder and propose a Hadamard Product Perceptron Transformer (HPPT) as a feature enhancement encoder. Moreover, HPPA and HPPT can be easily applied to existing attention or Transformer based models. Extensive experiments on MSCOCO and Flickr30k datasets demonstrate the effectiveness and generalizability of our proposal.
Loading