Recent Advances on Multi-modal Dialogue Systems: A Survey

Fenghua Cheng, Xue Li, Haoyang Wu, Jiangcheng Sang, Wenqi Zhao

Published: 2024, Last Modified: 10 Apr 2025ADMA (5) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Empowering conversational agents to see the world and interact with humans using all their senses is one of the long-term goals of Artificial Intelligence (AI). Multi-modal interactions are crucial in real-world conversations and compensation between different modalities helps improve the quality of the conversation. To this end, a growing research interest has been devoted to developing a multi-modal conversational agent with visual ability. Different from traditional unimodal dialogue systems, a multi-modal dialogue system can read context from multiple modalities and respond based on the understanding of them. In this work, we provide a comprehensive review of recent advances achieved in multi-modal dialogue generation. First of all, we categorize the multi-modal dialogue systems according to the tasks that they aim to address. Then, we review benchmark datasets as well as evaluation metrics. Finally, we discuss some existing challenges and promising directions for future work.