Safe Deployment for Counterfactual Learning to Rank with Exposure-Based Risk Minimization
Abstract: Counterfactual learning to rank ( CLTR ) relies on exposure-based
inverse propensity scoring ( IPS ), a LTR -specific adaptation of IPS
to correct for position bias. While IPS can provide unbiased and
consistent estimates, it often suffers from high variance. Especially
when little click data is available, this variance can cause CLTR to
learn sub-optimal ranking behavior. Consequently, existing CLTR
methods bring significant risks with them, as naively deploying
their models can result in very negative user experiences.
We introduce a novel risk-aware CLTR method with theoretical
guarantees for safe deployment. We apply a novel exposure-based
concept of risk regularization to IPS estimation for LTR . Our risk
regularization penalizes the mismatch between the ranking behavior of a learned model and a given safe model. Thereby, it ensures
that learned ranking models stay close to a trusted model, when
there is high uncertainty in IPS estimation, which greatly reduces
the risks during deployment. Our experimental results demonstrate
the efficacy of our proposed method, which is effective at avoiding initial periods of bad performance when little data is available,
while also maintaining high performance at convergence. For the
CLTR field, our novel exposure-based risk minimization method
enables practitioners to adopt CLTR methods in a safer manner
that mitigates many of the risks attached to previous methods.
Loading