Query Variability and Experimental Consistency: A Concerning Case Study

Published: 07 Jun 2024, Last Modified: 07 Jun 2024ICTIR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Evaluation, significance testing
Abstract: In offline experimentation, the effectiveness of a search engine is evaluated using a document collection, a set of queries against that collection, a set of relevance judgments connecting the documents and the queries, and an effectiveness metric. This measurement pipeline is used as a surrogate for user satisfaction - the extent to which the system provides useful information to the users issuing the queries. But queries are responses to information needs, or topics, and there can be a wide variety of ways in which any given information need can be expressed as a query. That one-to-many relationship suggests that, in an IR experiment, use of any single query to represent a topic may be insufficient. In this case study, we demonstrate that this practice is indeed a weakness. We show that the TREC 2013 and 2014 Web track queries, which are regarded as being indicative of specific information needs, are not representative of crowd-generated queries for the same underlying needs, and can give rise to inconsistent system relativities when compared to user-generated queries. From this instance we must thus note a clear concern: that current test collection design strategies can lead to effectiveness results that are at odds with those experienced by typical non-expert users.
Submission Number: 27
Loading