Fairness in Information Retrieval

Aldo Lipani

2016 (modified: 11 Nov 2022)SIGIR 2016Readers: Everyone

Abstract: The offline evaluation of Information Retrieval (IR) systems is performed through the use of test collections. A test collection, in its essence, is composed of: a collection of documents, a set of topics and, a set of relevance assessments for each topic, derived from the collection of documents. Ideally, for each topic, all the documents of the test collection should be judged, but due to the dimensions of the collections of documents, and their exponential growth over the years, this practice soon became impractical. Therefore, early in IR history, this problem has been addressed through the use of the pooling method. The pooling method consists of optimizing the relevance assessment process by pooling the documents retrieved by different search engines following a particular pooling strategy. The most common one consists on pooling the top d documents of each run. The pool is constructed from systems taking part in a challenge for which the collection was made, at a specific point in time, after which the collection is generally frozen in terms of relevance judgments. This method leads to a bias called pool bias, which is the effect that documents that were not selected in the pool created from the original runs will never be considered relevant. Thereby, this bias affects the evaluation of a system that has not been part of the pool, with any IR evaluation measures, making the comparison with pooled systems unfair. IR measures have evolved over the years and become more and more complex and difficult to interpret. Witnessing a need in industry for measures that 'make sense', I focus on the problematics of the two fundamental IR evaluation measures, Precision at cut-off P@n and Recall at cut-off [email protected]$. There are two reasons to consider such 'simple' metrics: first, they are cornerstones for many other developed metrics and, second, they are easy to understand by all users. To the eyes of a practitioner, these two evaluation measures are interesting because they lead to more intuitive interpretations like, how much time people are reading useless documents (low precision), or how many relevant documents they are missing (low recall). But this last interpretation, due to the fact that recall is inversely proportional to the number of relevant documents per topic, is very difficult to be addressed if to be judged is just a portion of the collection of documents, as it is done when using the pooling method. To tackle this problem, another kind of evaluation has been developed, based on measuring how much an IR system makes documents accessible. Accessibility measures can be seen as a complementary evaluation to recall because they provide information on whether some relevant documents are not retrieved due to an unfairness in accessibility. The main goal of this Ph.D. is to increase the stability and reusability of existing test collections, when to be evaluated are systems in terms of precision, recall, and accessibility. The outcome will be: the development of a novel estimator to tackle the pool bias issue for P@n, and R@n, a comprehensive analysis of the effect of the estimator on varying pooling strategies, and finally, to support the evaluation of recall, an analytic approach to the evaluation of accessibility measures.

0 Replies