DiverseRAG: Multi-Source Retrieval Augmented Generation for Multilingual and Multidialectal Question Answering with LLMs

ACL ARR 2024 December Submission377 Authors

13 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The field of question answering (QA) has been significantly transformed by the emergence of Large Language Models (LLMs). However, their performance in domain-specific QA, such as in e-government applications, is limited by their access to external, real-time, and highly specific knowledge. To address this, we introduce DiverseRAG, a novel framework that combines Retrieval-Augmented Generation (RAG) with LLMs, emphasizing a multi-source and multi-grained retrieval process to enhance response accuracy and relevance. Our approach employs a multi-source RAG strategy, drawing from diverse data types such as web pages and legal texts, and a multi-grained retrieval process that operates on sentence and multi-sentence levels to ensure both precision and contextual depth in addressing questions. This approach ensures comprehensive coverage of government-related questions. To test DiverseRAG, we curated an English-Arabic dataset from UAE government websites and further extend the questions into 4 Arabic dialects: Egyptian, Iraqi, Lebanese, and Emirati. Our results demonstrate that DiverseRAG substantially boosts performance of LLMs for English, MSA, and dialectal Arabic queries in the government domain, achieving over 10% improvement in metrics such as F-1 score, BertScore, ROUGE and Context Precision compared to conventional RAG approach in the best case.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: RAG, Question Answering, Arabic Dialects
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data resources
Languages Studied: English, MSA, Egyptian, Iraqi, Lebanese, and Emirati
Submission Number: 377
Loading