Keywords: offline reinforcement learning, offline black-box optimization, Bayesian optimization, contextual bandits, off-policy evaluation, identification, observational equivalence, position paper
TL;DR: Offline-evaluation score gaps are observationally equivalent under distribution-shift, coverage, and policy-improvement. We propose a four-item identification standard for offline-to-online transfer.
Abstract: Offline reinforcement learning, offline black-box optimization, contextual bandits, and Bayesian optimization all share a common evaluation challenge: a method's reported performance on a logged offline dataset is treated as evidence about how the method will perform when deployed online. We argue this transfer is observationally equivalent under three distinct mechanisms that current evaluation cannot separate: distribution-shift artifact, dataset-coverage artifact, and genuine policy improvement. The same offline-evaluation score can be rationalized by any combination of these three mechanisms, so a high offline score alone cannot identify which one will generalize to online deployment. We formalize this as a non-identification result, conduct a cross-domain audit of the principal offline-to-online literature documenting where the identification gap is largest in each subfield, propose a four-item identification standard that offline-to-online papers should disclose, and argue that the cross-domain perspective sharpens the standard by showing how the same identification problem manifests differently across offline RL, offline BBO, off-policy evaluation, and Bayesian optimization. The standard is cheap to apply, draws on quantities offline papers already collect, and gives reviewers a common vocabulary for evaluating online-deployment claims.
Submission Number: 152
Loading