Keywords: offline reinforcement learning, off-policy evaluation, partial identification, support holes, Bellman–Whitney envelopes, Lipschitz smoothness, uncertainty quantification
TL;DR: Under genuine support holes, offline policy values are not point-identified; we show their exact Bellman-consistent value interval under Lipschitz smoothness and derive sharp uncertainty and action certificates.
Abstract: We study finite-horizon offline evaluation and control when a target policy enters state--action regions with zero behavior support, so the target value is not point-identified. We introduce a Bellman--Lipschitz compatibility class that constrains candidate $Q$-sequences only through Bellman equalities on the observed support and Lipschitz extensions off support. Under a rectangular Bellman--Lipschitz closure condition, we prove that the exact identified interval of the target-policy value is given by a backward Bellman--Whitney recursion, and that this recursion recovers the sharp smooth no-overlap interval exactly when $H=1$. We further show that the same endpoints admit a no-gap dual characterization via one-sided Bellman relaxations, and we identify a dynamic support-hole geometry for the interval width that is sharp on explicit least-favorable sequential families. On the statistical side, we prove deterministic stability of the recursive endpoints under joint perturbations of the support sets and supported Bellman operators, derive stagewise additive finite-sample endpoint-estimation bounds, and establish an oracle minimax lower bound on a favorable zero-width subclass. Finally, under the control analogue of our closure assumption, we derive Bellman--Whitney action certificates that partition actions into certifiably good, certifiably bad, and intrinsically ambiguous sets.
Submission Number: 24
Loading