Keywords: Chatbot, Multilingual, Information Extraction, Machine Translation
TL;DR: A multilingual RAG system that improves document QA accuracy using hybrid retrieval and open-source models.
Abstract: Document-based question-answering systems provide users with an intuitive interface to extract information from large text repositories without manual searching. These systems rely on advanced technologies such as Natural Language Processing (NLP) and Artificial Intelligence (AI) to understand document content and generate accurate responses to user queries. Retrieval-Augmented Generation (RAG) architecture enhances these capabilities by combining semantic retrieval with generative models to produce contextually grounded answers. As global organizations increasingly work with multilingual document collections, there is a growing need for systems that can process and query documents across language barriers. While some document QA systems have been developed for specific languages, most operate only in English, and few can handle multilingual document repositories effectively. If configured correctly, multilingual document QA systems have the potential of providing a digital information extraction solution that transcends language barriers. For our project, we developed a multilingual document processing system that enables users to upload documents in various languages and interact with their content through natural language queries. The system leverages Retrieval-Augmented Generation (RAG) architecture combined with multilingual language models to provide accurate, contextually relevant answers extracted from user-provided documents, regardless of the source language. This solution addresses the growing need for efficient multilingual document analysis in global academic, professional, and research contexts. By combining cross-lingual NLP techniques with user-friendly interfaces, the system democratizes access to complex multilingual document analysis, enabling users to quickly extract insights from diverse document collections without manual translation or language-specific searching.
Submission Number: 14
Loading