Abstract: The diminishing return property of ERR (Expected Reciprocal Rank) is highly intuitive and attractive: its user model says, for example, that after the users have found a highly relevant document at rank r, few of them will continue to examine rank $$(r+1)$$ and beyond. Recently, another IR evaluation measure based on diminishing return called iRBU (intentwise Rank-Biased Utility) was proposed, and it was reported that nDCG (normalised Discounted Cumulative Gain) and iRBU align surprisingly well with users’ SERP (Search Engine Result Page) preferences. The present study conducts offline evaluations of diminishing return measures including ERR and iRBU along with other popular measures such as nDCG, using four test collections and the associated runs from recent TREC tracks and NTCIR tasks. Our results show that the diminishing return measures generally underperform other graded relevance measures in terms of system ranking consistency across two disjoint topic sets as well as discriminative power. The results generalise a previous finding on ERR regarding its limited discriminative power, showing that the diminishing return user model hurts the stability of evaluation measures regardless of the utility function part of the measure. Hence, while we do recommend iRBU along with nDCG for evaluating adhoc IR systems from multiple user-oriented angles, iRBU should be used under the awareness that it can be much less statistically stable than nDCG.
0 Replies
Loading