Abstract: The integration of multimodal information, particularly visual content, into dialogue systems has primarily focused on interpreting user-provided inputs, while comparatively little attention has been given to the proactive use of such content to enhance responses. In this paper, we explore a new research direction that addresses this gap by enabling dialogue systems to autonomously determine when and how to supplement textual responses with relevant images, based on conversational context and user intent. To support this goal, we propose AMI (Automated Multimodal Insertion), a novel framework for dynamic, context-aware multimodal supplementation in dialogue. We also introduce RID (Response with Appropriate Image Dataset), a bilingual (Chinese-English) multimodal multi-turn dialogue dataset designed to train and evaluate systems on this capability. RID features fine-grained annotations on image insertion timing and rationale, along with carefully aligned image-text pairs to ensure semantic coherence. Our experiments demonstrate that models trained with RID not only generate more informative and engaging responses, but also exhibit a stronger ability to leverage visual content when it is truly beneficial. These findings highlight the potential of proactive multimodal supplementation and offer new insights for advancing the development of intelligent, human-like dialogue systems. Code and data are available at: https://github.com/Tanthen/SCIR-AMI.
External IDs:doi:10.1145/3746027.3758178
Loading