Evaluation of Temporal Change in IR Test Collections

Published: 07 Jun 2024, Last Modified: 07 Jun 2024ICTIR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Longitudinal Evaluation, Continuous Evaluation, Reproducibility
TL;DR: This study investigates how the temporal changes in test collections affect the retrieval results.
Abstract: Information retrieval systems have been evaluated using the Cranfield paradigm for many years. This paradigm allows a systematic, fair, and reproducible evaluation of different retrieval methods in fixed experimental environments. However, real-world retrieval systems must cope with dynamic environments and temporal changes that affect the document collection, topical trends, and the individual user's perception of what is considered relevant. Yet, the temporal dimension in IR evaluations is still little studied. To this end, this work investigates how the temporal generalizability of effectiveness evaluations can be assessed. As a conceptual model we generalize Cranfield type experiments to the temporal context by classifying the change in the essential components according to the operations of persistent storage known as CRUD. From the theoretical possible changes different evaluation scenarios are emerging and it is outlined what they imply. Based on these scenarios, renowned state-of-the-art retrieval systems are tested and it is investigated how the retrieval effectiveness changes on different levels of granularity. We show that the proposed measures can be well adapted to describe the changes in the retrieval results. The experiments conducted confirm that the retrieval effectiveness strongly depends on the evaluation scenario investigated. We find that not only the average retrieval performance of single systems but also the relative system performance are strongly affected by the components that change and to what extent these components changed.
Submission Number: 42
Loading