Cross-Lingual Multimodal Retrieval-Augmented Generation for Open Question Answering in Tamil and Yoruba
Keywords: Low-Resource Languages, Multimodal Learning, Retrieval-Augmented Generation, Question Answering, Knowledge Base Question Answering, Cross-Lingual Transfer, Multilingual Representation Learning, Benchmarking, Dataset, Bias Analysis, Data Scarcity, Failure Analysis
TL;DR: We introduce LR-MMQA, the first multimodal, cross-lingual KBQA benchmark for low-resource languages that reveals current model limitations, alongside XM-RAG, a novel RAG pipeline demonstrating effective zero-shot transfer and bias mitigation.
Abstract: As large language models (LLMs) with retrieval-augmented generation (RAG) gain traction in multimodal knowledge-base question answering (KBQA), concerns about their transfer to low-resource languages (LRLs) remain unaddressed. We introduce LR-MMQA, a benchmark assessing multimodal cross-lingual retrieval and reasoning under the challenges of LRLs. Using a state-of-the-art LLM, we translated the hardest questions from WebQA and MultimodalQA, creating a dataset that stresses cross-evidence aggregation and multi-hop inference. We also introduce XM-RAG, a cross-lingual multimodal RAG pipeline optimized for LRLs, which achieves 38.1 answer accuracy overall, over 6.3 points higher than the next best baseline. Our findings expose significant biases and discrepancies in existing systems, with LR-MMQA highlighting specific failure points. Notably, XM-RAG's performance on LR-MMQA is far below top models on English datasets (WebQA: 64.4, MultimodalQA: 73.48 answer accuracy), demonstrating that current methods still fail at complex, real-world tasks in LRLs. By releasing LR-MMQA and XM-RAG, we provide a resource to evaluate and address these gaps and guide progress toward equitable multimodal KBQA.
Submission Number: 91
Loading