Evaluating Performance and Trustworthiness of RAG Systems for Generating Administrative Text

Hugo Sánchez-Navalón, Carlos Monserrat, Darío Garigliotti, Cèsar Ferri

Published: 01 Jan 2024, Last Modified: 01 Aug 2025IDEAL (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: As administrative language tends to be formal and exempt from double meanings or figurative expressions, it is a particular domain in which to explore the performance of Language Models. This paper presents a study on the feasibility of creating administrative texts-based RAG systems to serve as chatbots, analyzing the performance for this task of several Small and Large Language Models and defining ways of evaluating whether they hallucinate or not and whether they provide the user useful information or not. Conventional metrics depending on ground truth labels, such as cosine similarity or those from the ROUGE family, are explored, as well as new approaches to using other metrics not so popular in text evaluation, such as Euclidean and Manhattan distances. Moreover, all those objective metrics are compared with a subjective Likert scale to assess their performance at solving real users’ problems and to find relations between subjective perceptions and objectively measured metrics for each of the RAG systems proposed. The results show that SLM models (such as NeuralChat) can perform as well as an LLM if RAG programming provides them with an appropriate context.

External IDs:dblp:conf/ideal/SanchezNavalonMGF24