FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

ACL ARR 2025 May Submission5388 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multimodal multihop question answering (MMQA) requires reasoning over images and text from multiple sources, an essential task for many real-world applications. Despite advances in visual question answering, this multihop setting remains underexplored due to a lack of quality datasets. Existing methods focus on single-hop, single-modality, or short texts, limiting real-world applications like interpreting educational documents with long, multimodal content. To fill this gap, we introduce FM2DS, the first framework for creating a high-quality dataset for MMQA. Our approach consists of a 5-stage pipeline that involves acquiring relevant multimodal documents from Wikipedia, synthetically generating high-level questions and answers, and validating them through rigorous criteria to ensure data quality. We evaluate our methodology by training models on our synthesized dataset and testing on two benchmarks: MultimodalQA and WebQA. Our results demonstrate that, with an equal sample size, models trained on our synthesized data outperform those trained on human-collected data by 1.9 in exact match (EM) score on average. Additionally, we introduce M2QA-Bench with 1k samples, the first benchmark for MMQA on long documents, generated using FM2DS and refined by human annotators.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: Few-shot data synthesis, Question generation, Dataset augmentation, Knowledge distillation, Vision-language models (VLMs / LVLMs), Multimodal Reasoning, Multihop Reasoning, Multimodal question answering (MMQA)
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 5388
Loading