Pencils Down! Automatic Rubric-based Evaluation of Retrieve/Generate Systems

Published: 07 Jun 2024, Last Modified: 07 Jun 2024ICTIR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: evaluation, large language models, flan-t5-large, grading rubric, nugget-based evaluation, question-based evaluation
TL;DR: The RUBRIC evaluation metric creates a grading rubric for each query then uses an LLM to grade all IR system responses.
Abstract: Current IR evaluation paradigms are challenged by large language models (LLMs) and retrieval-augmented generation (RAG) methods. Furthermore, evaluation either resorts to expensive human judgments or lead to an over-reliance on LLMs. To remedy this situation, we introduce the RUBRIC metric, which puts information retrieval systems to the proverbial test. This metric leverages a bank of query-related test questions to quantify relevant information content that is contained in the systems' responses. The process involves (1) decomposing the query into detailed questions, and (2) checking each for answerability using passages in the system response. Using three TREC benchmarks, we demonstrate that our LLM-based RUBRIC approach works successfully. Unlike previous LLM-based evaluation measures, our paradigm lends itself for incorporating a human-in-the-loop without the danger of over-reliance on AI or resorting to expensive manual passage-level judgments. Moreover, our evaluation is repeatable and extensible and can be scored with existing evaluation tools.
Submission Number: 11
Loading