TIGER: A Unified Generative Model Framework for Multimodal Dialogue Response Generation

Fanheng Kong, Peidong Wang, Shi Feng, Daling Wang, Yifei Zhang

Published: 2024, Last Modified: 18 Jul 2025LREC/COLING 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Responding with multimodal content has been recognized as one of the essential functionalities of intelligent conversational agents. However, existing research on multimodal dialogues primarily focuses on two topics: (1) textual response generation that ground the conversation on a given image; and (2) visual response selection based on the dialogue context. In light of the aforementioned gap, we propose mulTImodal GEnerator for dialogue Response (TIGER), a unified generative model framework for multimodal dialogue response generation. Through extensive experiments, TIGER has demonstrated new state-of-the-art results, providing users with an enhanced conversational experience. A multimodal dialogue system based on TIGER is available at https://github.com/friedrichor/TIGER. A video demonstrating the system is available at https://www.youtube.com/watch?v=Kd0CMwDs8Rk.