Cross-Lingual Multimodal Retrieval-Augmented Generation for Open Question Answering in Tamil and Yoruba

Published: 23 Sept 2025, Last Modified: 23 Dec 2025SPIGM @ NeurIPSEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Low-Resource Languages, Multimodal Representation Learning, Retrieval-Augmented Generation, Knowledge Base Question Answering, Cross-Lingual Transfer, Multilingual Embedding, Multimodal Benchmarking, Dataset Construction, Bias and Failure Analysis, Data Scarcity, Structured Data Applications, Empirical Evaluation
TL;DR: We introduce LR-MMQA, the first multimodal, cross-lingual KBQA benchmark for low-resource languages that reveals current model limitations, alongside XM-RAG, a novel RAG pipeline demonstrating effective zero-shot transfer and bias mitigation.
Abstract: As large language models (LLMs) with retrieval augmented generation (RAG) gain traction in multimodal knowledge base question answering (KBQA), concerns about their transfer to low resource languages (LRLs) remain unaddressed. We introduce LR-MMQA, a benchmark evaluating multimodal cross lingual retrieval and reasoning in LRLs. Using the hardest examples from WebQA and MultimodalQA, we build a high quality LRL benchmark through LLM assisted translation, human validation, and culturally aligned rewriting that reflects native speaker phrasing (i.e. what a native speaker would naturally ask) while preserving answerability. We also present XM-RAG, a cross lingual multimodal RAG pipeline for LRLs that reaches 38.1 answer accuracy, more than 7.8 points above the next best baseline. LR-MMQA exposes major performance gaps and failure modes in current systems. Notably, all baselines perform far below top English results (WebQA 64.4 and MultimodalQA 73.48), showing that existing methods still struggle with complex tasks in LRL settings. By releasing LR-MMQA and XM-RAG, we offer a resource to evaluate and address these gaps and guide progress toward equitable multimodal KBQA.
Submission Number: 77
Loading