Cross-modal multi-headed attention for long multimodal conversations

Published: 01 Jan 2023, Last Modified: 04 Mar 2025Multim. Tools Appl. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Most Conversational AI agents in today's marketplace are unimodal in which only text is exchanged between the user and the bot. However, employing additional modes (e.g., image) in the interaction improves customer experience, potentially increasing efficiency and profits in applications such as online shopping. Most of the existing techniques have used feature extraction from the multimodal inputs, but very few works used multi-headed attention from transformers conversational AI. In this work, we propose a novel architecture called Cross-modal Multi-headed Hierarchical Encoder-Decoder with Sentence Embeddings (CMHRED-SE) to enhance the quality of natural language response by better understanding features such as color, sentence structure, and continuity of the conversation. CMHRED-SE uses multi-headed attention and image representations from VGGNet19 and ResNet50 architectures to improve the effectiveness in fashion domain-specific conversations. The results of CMHRED-SE are compared with two other similar models, namely M-HRED and MHRED-attn, and the quality of answers returned by the models are evaluated using BLEU-4, ROUGE-L, and the Cosine scores. The evaluation results show an improvement of 5% for Cosine similarity, 9% for ROUGE-L F1 score, and 11% for the BLEU-4 score over the scores returned by the baseline models. The results also show that our approach better understands and generates clearer textual responses leveraging the sentence embeddings.
Loading