Keywords: off-policy evaluation, reinforcement learning, coverage
Abstract: Off-policy evaluation (OPE) is a fundamental task in reinforcement learning (RL). In the classic setting of \emph{linear OPE}, finite-sample guarantees often take the form
$$
\textrm{Prediction error} \le \textrm{poly}(C^\pi, d, 1/n, log(1/\delta)),
$$
where $d$ is the dimension of the features, and $C^\pi$ is a **_feature coverage parameter_** that characterizes the degree to which the visited features lie in the span of the data distribution. Though such guarantees are well-understood for several popular algorithms under the Bellman-completeness assumption, this form of guarantee has not yet been achieved in the minimal setting where it is only assumed that the target value function is linearly realizable in the features. Despite recent interest in tight characterizations for this setting, the right notion of coverage remains unclear, and candidate definitions from prior analyses have undesirable properties and are starkly disconnected from more standard quantities in the literature.
In this paper, we provide a novel finite-sample analysis of a canonical algorithm for this setting, LSTDQ. Inspired by an instrumental-variable (IV) view, we develop error bounds that depend on a novel coverage parameter, the feature-dynamics coverage, which can be interpreted as feature coverage in a linear dynamical system. With further assumptions, such as Bellman-completeness, our definition successfully recovers the coverage parameters specialized to those settings, providing a unified understanding for coverage in linear OPE.
Supplementary Material: pdf
Primary Area: learning theory
Submission Number: 14460
Loading