Ranking evaluation metrics from a group-theoretic perspective

TMLR Paper2157 Authors

08 Feb 2024 (modified: 02 Jul 2024)Under review for TMLREveryoneRevisionsBibTeX
Abstract: Confronted with the challenge of identifying the most suitable metric to validate the merits of newly proposed models, the decision-making process is anything but straightforward. Given that comparing rankings introduces its own set of formidable challenges and the likely absence of a universal metric applicable to all scenarios, the scenario does not get any better. Furthermore, metrics designed for specific contexts, such as for Recommender Systems, sometimes extend to other domains without a comprehensive grasp of their underlying mechanisms, resulting in unforeseen outcomes and potential misuses. Complicating matters further, distinct metrics may emphasize different aspects of rankings, frequently leading to seemingly contradictory comparisons of model results and hindering the trustworthiness of evaluations. We unveil these aspects in the domain of ranking evaluation metrics. Firstly, we show instances resulting in inconsistent evaluations, sources of potential mistrust in commonly used metrics; by quantifying the frequency of such disagreements, we prove that these are common in rankings. Afterward, we conceptualize rankings using the mathematical formalism of symmetric groups detaching from possible domains where the metrics have been created; through this approach, we can rigorously and formally establish essential mathematical properties for ranking evaluation metrics, essential for a deeper comprehension of the source of inconsistent evaluations. We conclude with a discussion, connecting our theoretical analysis to the practical applications, highlighting which properties are important in each domain where rankings are commonly evaluated. In conclusion, our analysis sheds light on ranking evaluation metrics, highlighting that inconsistent evaluations should not be seen as a source of mistrust but as the need to carefully choose how to evaluate our models in the future.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=5ibNhaomZT
Changes Since Last Submission: We considered the reviewers' comments and rewrote the text for readability and clarity. In particular: - We made clear what the problem is and what our contribution is. The reviewers mentioned that it was unclear the importance of transferring the problem of ranking evaluation metrics to such a general mathematical structure, in which context each of the mathematical properties is essential, and why; we addressed all these questions in the introduction, abstract, and the main part of the manuscript. Furthermore, the first version of the previous submission lacked some definitions of the considered metrics, and some proofs were only given in short versions. We have now an extended Appendix containing all the definitions and proofs of claims. - we summarize the findings in two tables: one table containing whether each metric satisfies the properties and one table summarizing the meaning and in which context each property is essential. Table 3 is derived from the previous version but is more concise. - we rigorously defined how to state whether a metric is robust or not. - we polished the complete manuscript, reducing the text to our contribution and the community interest. In particular, we noticed how some parts of the previous discussion were not well contextualized and not essential for the scope of the manuscript. We made the definition clearer and fixed the mentioned inconsistencies. - we included a comparison with previous literature, keeping in mind the reviewers' suggestions. We mentioned in which other contexts the same or similar properties have been introduced. - we added a discussion and mentioned for each of the contexts in which rankings are compared, which of the proposed properties are fundamental.
Assigned Action Editor: ~Jaakko_Peltonen1
Submission Number: 2157
Loading