EncouRAGe: Evaluating RAG Local, Reliable and Efficient

Jan Strich; Chris Biemann; Martin Semmann

EncouRAGe: Evaluating RAG Local, Reliable and Efficient

Jan Strich, Chris Biemann, Martin Semmann

Published: 01 May 2026, Last Modified: 01 May 2026RAG4Report 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: RAG, Evaluation, Library, Python, Methods, Benchmarking, Framework, LLM-as-a-judge

Abstract: We introduce $\textbf{EncouRAGe}$, a comprehensive Python library designed to streamline the development and evaluation of Retrieval-Augmented Generation (RAG) systems using Large Language Models (LLMs) and Embedding Models. EncouRAGe comprises five modular and extensible components: $\textit{Type Manifest}$, $\textit{RAG Factory}$, $\textit{Inference}$, $\textit{Vector Store}$, and $\textit{Metrics}$, facilitating flexible experimentation and extensible development. Each component helps to make development RAG evaluation and emphasizes $\textbf{scientific reproducibility}$, $\textbf{diverse evaluation metrics}$, and $\textbf{local deployment}$, enabling researchers to efficiently assess datasets within RAG workflows. This paper presents implementation details and an extensive evaluation across multiple benchmark datasets, including $\textit{25k QA pairs}$ and $\textit{over 51k documents}$. Our results show that RAG still underperforms compared to the $\textit{Oracle Context}$, while $\textit{Hybrid BM25}$ consistently achieves the best results across all four datasets. $\textbf{Code}$: https://github.com/uhh-hcds/encourage

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 8

Loading