Cross-Lingual Multimodal Retrieval-Augmented Generation for Open Question Answering in Tamil and Yoruba

ICLR 2026 Conference Submission22573 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal question answering, cross-lingual retrieval, low-resource languages, knowledge base question answering, retrieval-augmented generation, benchmark dataset, Tamil, Yoruba, multilingual evaluation, visual reasoning, cross-modal fusion, machine translation, language equity, computational linguistics, information retrieval
Abstract: As large language models (LLMs) with retrieval-augmented generation (RAG) gain traction in multimodal knowledge-base question answering (KBQA), concerns about their transfer to low-resource languages (LRLs) remain unaddressed. We introduce LR-MMQA, a benchmark assessing multimodal cross-lingual retrieval and reasoning under the challenges of LRLs. Using a state-of-the-art LLM, we translated the hardest questions from WebQA and MultimodalQA, creating a dataset that stresses cross-evidence aggregation and multi-hop inference. We also introduce XM-RAG, a cross-lingual multimodal RAG pipeline optimized for LRLs, which achieves 38.1 answer accuracy overall, over 6.3 points higher than the next best baseline. Our findings expose significant biases and discrepancies in existing systems, with LR-MMQA highlighting specific failure points. Notably, XM-RAG's performance on LR-MMQA is far below top models on English datasets (WebQA: 64.4, MultimodalQA: 73.48 answer accuracy), demonstrating that current methods still fail at complex, real-world tasks in LRLs. By releasing LR-MMQA and XM-RAG, we provide a resource to evaluate and address these gaps and guide progress toward equitable multimodal KBQA.
Primary Area: datasets and benchmarks
Submission Number: 22573
Loading