Improving Question-Answering Capabilities in Large Language Models Using Retrieval Augmented Generation (RAG): A Case Study on Yoruba Culture and Language

Published: 03 Mar 2024, Last Modified: 11 Apr 2024AfricaNLP 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Retrieval Augmented Generation
Abstract: This study addresses the phenomenon of hallucination in large language models (LLMs), particularly in GPT-3.5 turbo, when tasked with processing queries in Yoruba—a low resource language. Hallucination refers to the generation of incorrect information, often occurring due to the model’s unfamiliarity with specific content or languages not extensively covered during its pretraining phase. We propose a novel methodology that incorporates Retrieval-Augmented Generation (RAG) techniques to mitigate this issue. Our method utilizes an exclusive dataset derived from a Yoruba-centric blog, covering an array of subjects from the language’s learning resources to its folklore. By embedding this data into an open-source chroma database, we improve GPT-3.5 turbo’s ability to deliver responses that are not only linguistically and factually correct but also resonate with the cultural nuances of the Yoruba heritage. This enhancement marks a significant step towards the creation of a chatbot aimed at promoting and disseminating knowledge about the Yoruba culture and language.
Submission Number: 50
Loading