A Gold Standard Dataset for the Reviewer Assignment Problem

A Gold Standard Dataset for the Reviewer Assignment Problem

TMLR Paper1967 Authors

20 Dec 2023 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Many peer-review venues are either using or looking to use algorithms to assign submissions to reviewers. The crux of such automated approaches is the notion of the "similarity score"---a numerical estimate of the expertise of a reviewer in reviewing a paper---and many algorithms have been proposed to compute these scores. However, these algorithms have not been subjected to a principled comparison, making it difficult for stakeholders to choose the algorithm in an evidence-based manner. The key challenge in comparing existing algorithms and developing better algorithms is the lack of the publicly available gold-standard data that would be needed to perform reproducible research. We address this challenge by collecting a novel dataset of similarity scores that we release to the research community. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers who evaluated their expertise in reviewing papers they have read previously. We use this data to compare several popular algorithms currently employed in computer science conferences and come up with recommendations for stakeholders. Our three main findings are: - All algorithms make a non-trivial amount of error. For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases, thereby highlighting the vital need for more research on the similarity-computation problem. - Most existing algorithms are designed to work with titles and abstracts of papers, and in this regime the Specter+MFR algorithm performs best. - To improve performance, it may be important to develop modern deep-learning based algorithms that can make use of the full texts of papers: the classical TD-IDF algorithm enhanced with full texts of papers is on par with the deep-learning based Specter+MFR that cannot make use of this information. We encourage researchers to use this dataset for evaluating and developing better similarity-computation algorithms.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: - Highlighted limitation of lack of diversity of participants in the introduction. - Slight recomputations of the numbers based on reviewers' suggestion of excluding 6 papers in one of the analyses (leads to very little change in the outcomes) - More detail of survey text in the main text - Changed figure 1a to include entire histogram, and its associated discussion to make it consistent with other parts of the paper

Assigned Action Editor: ~Jaakko_Peltonen1

Submission Number: 1967

Loading