Counterfactual Evaluation of Peer-Review Assignment Policies

Martin Saveski; Steven Jecmen; Nihar B Shah; Johan Ugander

Counterfactual Evaluation of Peer-Review Assignment Policies

Martin Saveski, Steven Jecmen, Nihar B Shah, Johan Ugander

Published: 21 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 spotlightEveryoneRevisionsBibTeX

Keywords: peer review, causal inference, counterfactual policy evaluation

TL;DR: We propose methods for off-policy evaluation in peer review and apply them to data from the TPDP’21 workshop and the AAAI’22 conference.

Abstract: Peer review assignment algorithms aim to match research papers to suitable expert reviewers, working to maximize the quality of the resulting reviews. A key challenge in designing effective assignment policies is evaluating how changes to the assignment algorithm map to changes in review quality. In this work, we leverage recently proposed policies that introduce randomness in peer-review assignment—in order to mitigate fraud—as a valuable opportunity to evaluate counterfactual assignment policies. Specifically, we exploit how such randomized assignments provide a positive probability of observing the reviews of many assignment policies of interest. To address challenges in applying standard off-policy evaluation methods, such as violations of positivity, we introduce novel methods for partial identification based on monotonicity and Lipschitz smoothness assumptions for the mapping between reviewer-paper covariates and outcomes. We apply our methods to peer-review data from two computer science venues: the TPDP'21 workshop (95 papers and 35 reviewers) and the AAAI'22 conference (8,450 papers and 3,145 reviewers). We consider estimates of (i) the effect on review quality when changing weights in the assignment algorithm, e.g., weighting reviewers' bids vs. textual similarity (between the review's past papers and the submission), and (ii) the "cost of randomization", capturing the difference in expected quality between the perturbed and unperturbed optimal match. We find that placing higher weight on text similarity results in higher review quality and that introducing randomization in the reviewer-paper assignment only marginally reduces the review quality. Our methods for partial identification may be of independent interest, while our off-policy approach can likely find use in evaluating a broad class of algorithmic matching systems.

Supplementary Material: pdf

Submission Number: 10830

Loading