An Entailment Tree Generation Approach for Multimodal Multi-Hop Question Answering with Mixture-of-Experts and Iterative Feedback Mechanism
Abstract: With the rise of large-scale language models(LLMs), it is currently popular and effective to convert multimodal information into text descriptions for multimodal multi-hop question answering. However, we argue that the current methods of multi-modal multi-hop question answering still mainly face two challenges: 1) The retrieved evidence containing a large amount of redundant information, inevitably leads to a significant drop in performance due to irrelevant information misleading the prediction. 2) The reasoning process without interpretable reasoning steps makes the model difficult to discover the logical errors for handling complex questions. To solve these problems, we propose a unified LLMs-based approach but wihout heavily relying on them due to the LLM's potential errors, and innovatively treat multimodal multi-hop question answering as a joint entailment tree generation and question answering problem. Specifically, we design a multi-task learning framework with a focus on facilitating common knowledge sharing across interpretability and prediction tasks while preventing task-specific errors from interfering with each other via mixture of experts. Afterward, we design an iterative feedback mechanism to further enhance both tasks by feeding back the results of the joint training to the LLM for regenerating entailment trees, aiming to iteratively refine the potential answer. Notably, our method has \textbf{won the first place} in the official leaderboards of WebQA (since April 10, 2024), and achieving competitive results on MultimodalQA.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Media Interpretation, [Generation] Multimedia Foundation Models, [Content] Vision and Language
Relevance To Conference: This work contributes to multimedia/multimodal processing by:
1. Multimodal Information Processing: Our method effectively transforms multimodal information into text for multi-hop question answering, demonstrating our ability to handle and integrate multimodal data.
2.Redundancy Elimination: By constructing a fact base and entailment tree, our method eliminates redundant information, improving answer accuracy and multimodal processing efficiency.
3.Interpretable Reasoning: Our method generates explicit reasoning steps, crucial for understanding multimodal processing results. Through iterative enhancement, the entailment tree is continually corrected, enhancing system robustness.
4.Outstanding Results: Our method ranked first in the WebQA leaderboard and achieved competitive results on MultimodalQA, proving its effectiveness and superiority in multimodal question answering tasks.
Submission Number: 4202
Loading