A Gold Standard Dataset for the Reviewer Assignment Problem

Ivan Stelmakh; John Frederick Wieting; Yang Xi; Graham Neubig; Nihar B Shah

A Gold Standard Dataset for the Reviewer Assignment Problem

Ivan Stelmakh, John Frederick Wieting, Yang Xi, Graham Neubig, Nihar B Shah

Published: 30 Apr 2025, Last Modified: 02 May 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Many peer-review venues are using algorithms to assign submissions to reviewers. The crux of such automated approaches is the notion of the “similarity score’’ — a numerical estimate of the expertise of a reviewer in reviewing a paper — and many algorithms have been proposed to compute these scores. However, these algorithms have not been subjected to a principled comparison, making it difficult for stakeholders to choose the algorithm in an evidence-based manner. The key challenge in comparing existing algorithms and developing better algorithms is the lack of the publicly available gold-standard data that would be needed to perform reproducible research. We address this challenge by collecting a novel dataset of similarity scores that we release to the research community. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers who evaluated their expertise in reviewing papers they have read previously. We use this data to compare several popular algorithms currently employed in computer science conferences and come up with recommendations for stakeholders. Our four main findings are: - All algorithms make a non-trivial amount of error. For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases, thereby highlighting the vital need for more research on the similarity-computation problem. - Most specialized algorithms are designed to work with titles and abstracts of papers, and in this regime the Specter2 algorithm performs best. - The classical TF-IDF algorithm which can use full texts of papers is on par with Specter2 that uses only titles and abstracts. - The performance of off-the-shelf LLMs is worse than the specialized algorithms. We encourage researchers to participate in our survey and contribute their data to the dataset here: https://forms.gle/SP1Rh8eivGz54xR37

Submission Length: Long submission (more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=gEjE89BhcI

Changes Since Last Submission: This is a resubmission of TMLR submission 1967. - There are valid limitations that reviewers pointed out, and in the revision, we have clearly highlighted these limitations including in the introduction. - Various technical comments from the reviewers are addressed. - Several comments were unrelated to the claims in the paper. In our previous submission, the AE cited these comments as reasons for rejection. Based on TMLR's review criteria, we believe these comments are irrelevant to the decision-making process. As one example, a prominent AE comment is that "Lack of a new algorithmic solution was criticized" whereas we never claimed to make algorithmic contributions. We had requested a better understanding from the AE on how these comments relate to our paper's claims and TMLR's criteria but did not obtain any clarification. We hope that the resubmission will be evaluated according to TMLR's criteria. ---------------- We now provide more details of the reviewer comments from the previous submission and our responses and changes. > Limitations of sample size and diversity (particularly large fraction of population from the US) We have acknowledged these limitations in the introduction of the paper. Please see the "Key limitations" heading in the introduction. We reiterate these limitations in the discussion section at the end of the paper. > Request for using a ordered list. We had done that in the initial submission: In our experiments, we have used rankings as a measure of accuracy. Specifically, in Section 7, conditioned on whether we are in the easy or hard triple setting, the loss function is the 0-1 loss on whether the algorithm predicts the relative ranking of a pair of papers correct or not. > Request for conducting another survey to validate this survey's interfaces, or use publication/citation counts. Regarding the first comment, we believe this is beyond the scope of our paper, based on the established research and publication protocols in this field. Most peer review studies involving human subjects have dependencies on the interfaces used -- instructions, scoring levels, and level titles all influence reviewer scores. However, there seldom exist any papers with multiple studies involving different interfaces. Nonetheless, we have acknowledged this limitation in our submission: "the design of the survey interface can influence the data and, consequently, the evaluations. We hope for further studies on reviewer expertise, potentially employing diverse methodologies. When combined with our dataset, these additional studies could provide a more robust and multi-dimensional evaluation." Finally, regarding the latter comment on using publication/citation counts (reviewer uEkS), we don't understand what the reviewer is talking about. > Mention more details about survey instructions We did that in the revision we had submitted: we included the text provided to the participants containing an example of what we meant by tricky examples. This is in the "expertise evaluations" part of Section 3 in the revision. > Scope Limitation to Computer Science In the introduction, we state "Specifically, we conduct a survey of computer science researchers" to ensure our claims are about computer science. We also removed the following statement from the paper about other disciplines "other communities may also use it to evaluate existing or develop new similarity-computation algorithms. To evaluate an algorithm from another domain on our data, researchers can fine-tune their algorithm on profiles of computer science scientists crawled from Semantic Scholar and then evaluate it on our dataset." > "Depending on the section, the experimental results use 2 slightly different versions of the dataset"...."low expertise is defined [slightly differently in two sections]"..."would be good to see if using 1 standard deviation provides any additional (albeit lower confidence) conclusions" Regarding the first two comments on consistency: we had fixed these issues. For the latter we responded as a publicly visible response on openreview with all the numbers (https://openreview.net/forum?id=gEjE89BhcI&noteId=W93zrQmrGK). We did not include it in the paper as it did not seem to yield additional insights. > Reviewer uEkS' requested changes 2 and 3 (privacy, reviewer's speed of response, distribution of review workloads, scaling, encryption etc.) These are well beyond the claims of this paper. We only claim to evaluate the accuracy of the algorithms with respect to the self-reports of the researchers who participated in the study. > Lack of algorithmic contributions / Use of larger pretrained models finetuned on scientific domains was desired. We make no claims of new algorithms. Our claims pertain to the new dataset and evaluating existing algorithms.

Code: https://github.com/niharshah/goldstandard-reviewer-paper-match

Assigned Action Editor: ~Brian_Kulis1

Submission Number: 3019

Loading