AutoGraph: Enabling Visual Context via Graph Alignment in Open Domain Multi-Modal Dialogue Generation

Deji Zhao; Donghong Han; Ye Yuan; Bo Ning; Li Mengxiang; Zhongjiang He; Shuangyong Song

AutoGraph: Enabling Visual Context via Graph Alignment in Open Domain Multi-Modal Dialogue Generation

Deji Zhao, Donghong Han, Ye Yuan, Bo Ning, Li Mengxiang, Zhongjiang He, Shuangyong Song

Published: 20 Jul 2024, Last Modified: 01 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Open-domain multi-modal dialogue system heavily relies on visual information to generate contextually relevant responses. The existing open-domain multi-modal dialog generation methods ignore the complementary relationship between multiple modalities, and are difficult to integrate with LLMs. To address these issues, we propose an automatically constructed visual context graph method, called AutoGraph. We aim to structure complex information and seamlessly integrate it with large language models (LLMs), aligning information from multiple modalities at both semantic and structural levels. Specifically, we fully connect the text graphs and scene graphs, and then trim unnecessary edges via LLMs to automatically construct a visual context graph. Next, we design several graph sampling grammar for the first time to convert graph structures into sequence which is suitable for LLMs. Finally, we propose a two-stage fine-tuning method to allow LLMs to understand graph sampling grammar and generate responses. The AutoGraph method is a general approach that can enhance the visual capabilities of LLMs. We validate our proposed method on text-based LLMs, and visual-based LLMs, respectively. Experimental results show that our proposed method achieves state-of-the-art performance on multiple public datasets.

Primary Subject Area: [Content] Vision and Language

Secondary Subject Area: [Content] Vision and Language, [Content] Multimodal Fusion

Relevance To Conference: This paper focuses on open domain multi-modal dialog generation and proposes a method for automatically constructing visual context graphs to align multiple modalities in terms of semantics and structure. For the first time, a novel graph sampling grammar is introduced to endow large language models with stronger visual capabilities. We believe that the multimodal alignment and fusion of multiple modalities with large language method proposed in this paper is in line with the theme of the conference. Furthermore, through experiments conducted on several different large language models, we demonstrate that our proposed method can effectively enhance the generated capabilities of large language models in open-domain multi-modal dialogue.

Supplementary Material: zip

Submission Number: 2193

Loading