A Picture Is Worth a Graph: A Blueprint Debate Paradigm for Multimodal Reasoning

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper presents a pilot study aimed at introducing multi-agent debate into multimodal reasoning. The study addresses two key challenges: the trivialization of opinions resulting from excessive summarization and the diversion of focus caused by distractor concepts introduced from images. These challenges stem from the inductive (bottom-up) nature of existing debating schemes. To address the issue, we propose a deductive (top-down) debating approach called Blueprint Debate on Graphs (BDoG). In BDoG, debates are confined to a blueprint graph to prevent opinion trivialization through world-level summarization. Moreover, by storing evidence in branches within the graph, BDoG mitigates distractions caused by frequent but irrelevant concepts. Extensive experiments validate that BDoG is able to achieve state-of-the-art results in ScienceQA and MMBench with significant improvements over previous methods. The source code can be accessed at https://github.com/open_upon_acceptance.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: This paper has presented a pioneering pilot study that introduces multi-agent debate into the realm of multimodal reasoning. We tackled two prominent challenges faced in this context: the issue of opinions being trivialized and focus diversion. By recognizing the limitations of existing debating schemes, we propose Blueprint Debate on Graphs (BDoG), which confines debates to a blueprint graph and stores evidence in graph branches, to address the challenges of word-level opinion trivialization and distraction caused by irrelevant concepts. Extensive experiments conducted in Science QA and MMBench validate the efficacy of BDoG, surpassing previous methods and establishing new state-of-the-art results. This work opens up new possibilities for multimodal reasoning and has implications for various multimedia applications, including multimedia retrieval, question answering, and decision-making.
Supplementary Material: zip
Submission Number: 2569
Loading