Off-Policy Evaluation for Ranking Policies under Deterministic Logging Policies

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: off-policy evaluation; ranking; common support; deterministic logging
TL;DR: We propose novel OPE estimators that work even when your logs come from a fully deterministic ranker by using user click randomness instead of policy randomness.
Abstract: Off-Policy Evaluation (OPE) is an important practical problem in algorithmic ranking systems, where the goal is to estimate the expected performance of a new ranking policy using only offline logged data collected under a different, logging policy. Existing estimators, such as the ranking-wise and position-wise inverse propensity score (IPS) estimators, require the data collection policy to be sufficiently stochastic and suffer from severe bias when the logging policy is deterministic. In this paper, we propose novel estimators, Click-based Inverse Propensity Score (CIPS) and Click-based Doubly Robust (CDR), which exploit the intrinsic stochasticity of user click behavior to address this challenge. Unlike existing methods that rely on the stochasticity of the logging policy, our approach uses click probability as a new form of importance weighting, enabling low-bias OPE even under deterministic logging policies where existing methods incur substantial bias. We provide theoretical analyses of the bias and variance properties of the proposed estimators and show, through synthetic and real-world experiments, that our estimators achieve significantly lower bias compared to strong baselines, particularly in settings with completely deterministic logging policies.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 2960
Loading