Language Bias in Multilingual RAG: A Case Study in the Japanese Medical Domain

ACL ARR 2025 February Submission6537 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite the significant achievements of LLMs in recent years, their performance in low-resource language-domain pairs remains less than satisfactory. Although RAG is often considered a solution, we identify the paradox: the LLM's poor performance in low-resource language-domain pairs is due to a lack of corpora, but RAG also relies on comprehensive and high-quality corpora. We show that this paradox could lead to failure of RAG in certain low-resource language-domain pairs, like the Japanese medical domain. We propose to use high-resource corpora to enhance the knowledge coverage. We also identify and address the language bias issue when using multilingual corpora, which prevents the RAG framework from fully utilizing the multilingual corpus. Through our proposed RAG framework and reranker training method, the RAG performance of LLMs is improved by 4.36-7.96 percentage point on JMedBench.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: NLP Applications, Information Retrieval and Text Mining
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: Japanese, English
Submission Number: 6537
Loading