Who Will Evaluate the Evaluators? Exploring the Gen-IR User Simulation Space

Johannes Kiesel, Marcel Gohsen, Nailia Mirzakhmedova, Matthias Hagen, Benno Stein

Published: 2024, Last Modified: 16 Jan 2026CLEF (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The reliable and repeatable evaluation of interactive, conversational, or generative IR systems is an ongoing research topic in the field of retrieval evaluation. One proposed solution is to fully automate evaluation through simulated user behavior and automated relevance judgments. Still, simulation frameworks were technically quite complex and have not been widely adopted. Recently, however, easy access to large language models has drastically lowered the hurdles for both user behavior simulation and automated judgments. We therefore argue that it is high time to investigate how simulation-based evaluation setups should be evaluated themselves. In this position paper, we present GenIRSim, a flexible and easy-to-use simulation and evaluation framework for generative IR, and we explore GenIRSim’s parameter space to identify open research questions on evaluating simulation-based evaluation setups.

External IDs:dblp:conf/clef/KieselGMHS24