A review of state-of-the-art methods for evaluating the quality of large language models in systems using RAG

A review of state-of-the-art methods for evaluating the quality of large language models in systems using RAG

MathAI 2025 Conference Submission28 Authors

01 Feb 2025 (modified: 22 Feb 2025)MathAI 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: question-answer systems, large language models, LLM, RAG, quality evaluation

TL;DR: This paper researches state-of-the-art methods for developing question-answering systems based on large language models, focusing on RAG for Russian. Experiments have been conducted to evaluate different basic approaches to RAG.

Abstract: This paper researches one of today's most relevant approaches to developing different types of intelligent assistants, question-answering systems based on large language models (LLM), based on in-context learning or retrieval augmented generation (RAG). Recent numerous publications on this topic are mostly English-oriented and use state-of-the-art models such as GPT-4 and its later versions. However, RAG quality evaluations for Russian language tasks are practically absent. This research has provided measures of quality metrics using existing frameworks for three different domains of knowledge. We deliberately refrained from combinatorial RAG methods such as SelfRAG in order to separately evaluate several basic RAG construction approaches, including naive RAG, HyDE, and BM25. This provides an opportunity to construct an efficient hybrid RAG. The obtained evaluations are qualitatively consistent with the results of other researches. During the experiments, high variability of responses of large language models is noted, which was overcome by averaging a series of responses. The results of the experiments can serve as baseline evaluations for decision making when selecting RAG architectures.

Submission Number: 28

Loading