\documentclass[accepted]{uai2022} % for initial submission
% \documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \bibpunct{(}{)}{;}{a}{,}{,}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
%\usepackage{tikz} % nice language for creating drawings and diagrams


\usepackage{textcase}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{amsmath, amsthm, amssymb, amsfonts, mathtools, graphicx, enumitem}
\usepackage{algorithm,algorithmic}
\renewcommand{\algorithmicrequire}{\textbf{Input:}}
\renewcommand{\algorithmicensure}{\textbf{Output:}}
\usepackage{mkolar_definitions}

\newtheorem*{theorem*}{Theorem}

\usepackage{comment}



%%%% Drawing
\usepackage{tikz}
\usepackage{bbm}
\usetikzlibrary{automata, arrows}
\usetikzlibrary{positioning}



\usepackage{xr}
\makeatletter
\newcommand*{\addFileDependency}[1]{% argument=file name and extension
	\typeout{(#1)}
	\@addtofilelist{#1}
	\IfFileExists{#1}{}{\typeout{No file #1.}}
}
\makeatother

\newcommand*{\myexternaldocument}[1]{%
	\externaldocument{#1}%
	\addFileDependency{#1.tex}%
	\addFileDependency{#1.aux}%
}
\myexternaldocument{chen_628-supp}


%%%% cref, Cref
\usepackage[capitalise,nameinlink]{cleveref}
\Crefname{equation}{Eq.}{Eqs.}
\Crefname{assumption}{Assumption}{Assumptions}
\Crefname{condition}{Condition}{Conditions}


%%% Allow math equation to cross pages
\allowdisplaybreaks


\newcommand{\defeq}{:=}
%\newcommand{\gapmin}{\mathrm{gap}_{\mathrm{min}}}
\newcommand{\gapmin}{C_\mathrm{gap}}
\newcommand{\gap}{\mathrm{gap}}
\newcommand{\cgap}{C_\mathrm{gap}}
\newcommand{\gapq}{\mathrm{gap}(Q^*)}
\newcommand{\estat}{\varepsilon_{\mathrm{stat},n}}

%\newcommand{\mainalg}{\text{AlgName}\xspace}
%\newcommand{\algunknown}{\text{AlgNameUnknown}\xspace}


\hypersetup{
  colorlinks   = true, %Colours links instead of ugly boxes
  urlcolor     = blue, %Colour for external hyperlinks
  linkcolor    = blue, %Colour of internal links
  citecolor   = blue  %Colour of citations
}

\newcount\Comments  % 0 suppresses notes to selves in text
\Comments=0 % TODO: change to 0 for final version
\definecolor{darkred}{rgb}{0.7,0,0}
\definecolor{darkgreen}{rgb}{0,0.5,0}
\definecolor{orange}{rgb}{0.7,0.4,0}
\definecolor{purple}{rgb}{0.8,0.0,0.8}
%\newcommand{\kibitz}[2]{\ifnum\Comments=1{\textcolor{#1}{\textsf{\footnotesize #2}}}\fi}
\newcommand{\jc}[1]{\textcolor{orange}{[JC: #1]}}
\newcommand{\nj}[1]{\textcolor{red}{[NJ: #1]}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{Offline Reinforcement Learning Under Value and \\ Density-Ratio Realizability: The Power of Gaps}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author{Jinglin Chen, Nan Jiang}
%\author{Jinglin Chen}
%\author[1]{Nan Jiang\thanks{jinglinc@illinois.edu, nanjiang@illinois.edu}}
% Add affiliations after the authors
\affil{%
    Department of Computer Science\\
    University of Illinois Urbana-Champaign\\
    Urbana, IL, USA
    
    %\textrm{jinglinc@illinois.edu, nanjiang@illinois.edu}
}
%\affil[2]{%
%    Second Affiliation\\
%    Address\\
%    …
%}
%\affil[3]{%
%    Another Affiliation\\
%    Address\\
%    …
%  }
  
\begin{document}
\maketitle

\begin{abstract}
We consider a challenging theoretical problem in offline reinforcement learning (RL): obtaining sample-efficiency guarantees with a dataset lacking sufficient coverage, under only realizability-type assumptions for the function approximators. While the existing theory has addressed learning under realizability and under non-exploratory data separately, no work has been able to address both simultaneously (except for a concurrent work which we compare in detail). Under an additional gap assumption, we provide guarantees to a simple pessimistic algorithm based on a version space formed by marginalized importance sampling (MIS), and the guarantee only requires the data to cover the optimal policy and the function classes to realize the optimal value and density-ratio functions. While similar gap assumptions have been used in other areas of RL theory, our work is the first to identify the utility and the novel mechanism of gap assumptions in offline RL with weak function approximation.
\end{abstract}


\section{Introduction and Related Works}

%\blfootnote{jinglinc@illinois.edu, nanjiang@illinois.edu}

In offline reinforcement learning (RL), the learner searches for a good policy purely from historical (or offline) data, without direct interactions with the real environment. The lack of  intervention with the system makes offline RL a promising paradigm for learning sequential decision-making strategies in many important real-world applications. 

Early research in offline RL focused on analyzing approximate value and policy iteration algorithms and had significant overlap with the approximate dynamic programming literature \citep{munos2003error, munos2007performance, munos2008finite, antos2008learning,farahmand2010error}. These algorithms and their guarantees typically require relatively strong assumptions on both the expressivity of the function class and the exploratoriness of the dataset. For example, the analyses of Fitted Q-Iteration \citep{ernst2005tree,antos2007fitted} require the function class to be \textit{closed} under Bellman updates (also known as Bellman-completeness), and the offline data distribution to provide coverage (in some technical sense) over \textit{all} candidate policies \citep{chen2019information}. The former requirement is \textit{non-monotone} in the function class, which shatters the standard machine-learning intuition that a richer function class should always have better (or at least no worse) approximation power; it is also closely related to the instability of RL training and the infamous ``deadly triad'' \citep{sutton2018reinforcement, wang2021instabilities}. The latter requirement is very likely violated in practice since we have no control over how the historical data is collected 
\citep{fujimoto2019off}. 

Given these considerations, it is desirable to come up with novel algorithms and/or analyses to relax these assumptions. In particular, the ideal assumption on the function class is \textit{realizability}, that there is a target function of interest (such as the optimal value function) and we only require the function class to (approximately) capture such a function. The ideal assumption on the data distribution is \textit{single-policy} coverage, that it is ok for the data to not cover all policies, as long as an optimal (or sufficiently good) policy is covered. 

In recent years, significant progress has been made towards providing provable guarantees in offline RL under these relaxed assumptions. In particular, the principle of pessimism in face of uncertainty proves to be useful in designing algorithms that work under single-policy coverage \citep{liu2020provably, jin2021pessimism, xie2021bellman,yin2021near,rashidinejad2021bridging}, but most of the existing pessimistic algorithms require Bellman completeness on the function class. On the other hand, relaxing Bellman completeness to realizability has been difficult: there is merely one existing result that requires only the realizability of optimal value function \citep{xie2020batch}, yet their data assumption is even stronger than all-policy coverage. In fact, a recent information-theoretic lower bound by \citet{foster2021offline} confirms that even with a strong notion of all-policy coverage called (all-policy) concentrability, plus the realizability of value functions for \textit{all} policies, offline RL is still fundamentally intractable. 

Despite the lower bound, not all hope is lost. A promising way of breaking the lower bound is to assume the realizability of other functions beyond value functions. %, which is exactly the case in the literature on marginalized importance sampling (MIS). 
Indeed, positive results that are analogues to what we want are established in off-policy evaluation (OPE)---where the goal is to estimate the performance of a target policy from offline data---when additional realizability of \textit{density-ratio} functions is assumed. In particular,   \citet{liu2018breaking, uehara2020minimax} show that, as long as the data covers the target policy, and we are given function classes that can represent \textit{both} the value function and the density-ratio function (or marginalized importance weights) of the target policy, %between the target policy and the data, 
it is possible to estimate the performance of the target policy in a sample-efficient manner. 
One way of using such results for policy learning is to use OPE as a subroutine and optimize a policy using OPE's assessment of the policy's performance. Unfortunately, such a direct application introduces prohibitive expressivity assumptions we wanted to avoid beginning with, such as the realizability of value functions for \textit{all} candidate policies \citep{jiang2020minimax}. 

In this paper, we provide sample-efficiency guarantees for offline RL under the desired assumptions, that data is only guaranteed to cover the optimal policy, and the function classes only represent the optimal value function and density ratio, respectively. Our algorithms have a simple procedure that combines marginalized importance sampling (MIS) with pessimism in a novel fashion. The key enabler of our guarantees is an additional \textit{gap} assumption, that there is a nontrivial gap between the values of the (unique) greedy action and the second-best for every state. Similar gap assumptions are common in RL theory to characterize easy problems in which stronger-than-usual guarantees can be obtained. They are often used to achieve logarithmic or constant regret in bandits and tabular online RL  \citep{bubeck2012regret,ok2018exploration,slivkins2019introduction,lattimore2020bandit,he2021logarithmic,papini2021reinforcement}, and similar guarantees in offline RL under Bellman-completeness and additional structural assumptions on the value-function class \citep{hu2021fast}. They are also used in online RL with function approximation to block exponential error amplification \citep{du2019provably}. To our knowledge, our work is the first one to identify the utility of gap assumptions in offline RL with weak function approximation and offer interesting insights into novel aspects and mechanisms of gaps %, such as disentangling the true gap and that of the function approximator 
(see \pref{sec:appx_error}).


\paragraph{Paper Organization} The rest of the paper is organized as follows: \pref{sec:prelim} introduces preliminary concepts and the problem setting. \pref{sec:alg} describes the algorithm. \pref{sec:main} provides the core analysis. %, with the proofs outlined in \pref{sec:proof}. 
We further extend our results to the setting where the function classes are misspecified (\pref{sec:appx_error}), and when the gap parameter is unknown but we have access to a small amount of online interactions (\pref{sec:unknown_gap}). We conclude the paper with further discussions in \pref{sec:discuss}, including a detailed comparison to the concurrent work of \citet{zhan2022offline} on the same problem. 


\section{Preliminaries} \label{sec:prelim}
\paragraph{Markov Decision Processes (MDPs)}
We consider finite horizon episodic MDPs defined in the form of $\Mcal = (\Xcal, \Acal, P, R, H, x_0)$, where $\Xcal=\Xcal_0\bigcup\ldots\bigcup\Xcal_{H-1}$ is the layered state space with $\Xcal_h$ denoting the state space at timestep $h$, $\Acal$ is the action space, $P=(P_0,\ldots,P_{H-1})$ is the transition function with $P_h: \Xcal_h\times\Acal\to\Delta(\Xcal_{h+1})$, $R=(R_0,\ldots,R_{H-1})$ is the reward function with $R_h: \Xcal_h\times\Acal\to [0, 1]$, $H$ is the length of horizon, and $x_0$ is the fixed initial distribution.\footnote{We consider fixed initial state and deterministic reward function. They can be easily generalized to the stochastic case.} We assume the state and action spaces are finite but can be arbitrarily large, and $\Delta(\cdot)$ denotes the probability simplex over a finite set. We define a policy $\pi=\{\pi_0,\ldots,\pi_{H-1}\}$, where for each $h\in[H]$, $\pi_h: \Xcal_h \to \Delta(\Acal)$ is the policy at timestep $h$ and we use $[H]$ to denote $\{0,\ldots,H-1\}$. With a slight abuse of notation, when $\pi_h(\cdot)$ is a deterministic policy, we assume $\pi_h(\cdot):\Xcal_h\to \Acal$. Policy $\pi$ induces a distribution over trajectories from the initial state distribution, which we denote as $\Pr\nolimits_\pi(\cdot)$ and can be described as starting with $x_0$ and $a_h \sim \pi(\cdot|x_h), r_h = R_h(x_h, a_h), x_{h+1} \sim P_h(\cdot|x_h, a_h), \forall h \in[H]$. As a convention, we will use $x_h,a_h,r_h$ to refer the state, action, and reward at timestep $h$ (thus $x_h\in\Xcal_h$). The performance of a policy is measured by its expected return, defined as $v^\pi := \EE_\pi[\sum_{h=0}^{H-1} r_h]$, where the expectation is taken with respect to $\Pr\nolimits_\pi(\cdot)$. For any  $f_h\in\RR^{\Xcal_h\times\Acal}$, we use $\pi_{f_h}(x_h):=\argmax_{a_h\in\Acal}f_h(x_h,a_h)$ to denote its greedy policy at timestep $h$. Among all policies, there always exists a policy, denoted as $\pi^*$, that maximizes the return from all starting states simultaneously. This policy is the greedy policy of the optimal action-value (or Q-) function, $Q^*=(Q_0^*,\ldots,Q_{H-1}^*)$, i.e., $\pi^* = \pi_{Q^*}:=(\pi_{Q^*_0},\ldots,\pi_{Q^*_{H-1}})$. $Q^*$ is the unique solution to the Bellman optimality equations $Q^*_h = \Tcal_h Q^*_{h+1}$, where $\Tcal_h: \RR^{\Xcal_{h+1}\times\Acal} \to \RR^{\Xcal_h\times\Acal}$ is the Bellman optimality operator: $\forall f_{h+1} \in \RR^{\Xcal_{h+1}\times\Acal}$, $(\Tcal_hf_{h+1})(x_h,a_h):=R_h(x_h,a_h)+\EE_{x_{h+1} \sim P_h(\cdot|x_h, a_h)}[ \max_{a_{h+1}}f_{h+1}(x_{h+1}, a_{h+1})]$. We can similarly define policy-specific Q-functions $Q^\pi$ and their state-value function counterparts, namely $V^*$ and $V^\pi$. Another useful concept is the notion of state-action occupancy of a policy $\pi$, $d^\pi_h(x_h',a_h') := \Pr\nolimits_\pi(x_h=x_h',a_h=a_h')$. As a shorthand, we define $d^*_h := d^{\pi^*}_h$ and use $a_{i:j}$ to refer actions $a_i,\ldots,a_j$.



\paragraph{Offline RL} We consider a standard theoretical setup for offline RL, where we are given a dataset $\Dcal=\Dcal_0\bigcup\ldots\bigcup\Dcal_{H-1}$ with the form $\Dcal_h=\{x_h^{(i)},a_h^{(i)},r_h^{(i)},$ $x_{h+1}^{(i)}\}_{i=1}^n$ and $\Dcal_h$ consists of $\{x_h,a_h,r_h,x_{h+1}\}$ tuples sampled i.i.d.~from the following generative process: $(x_h,a_h)\sim d^D_h,r_h=R_h(x_h,a_h),x_{h+1}\sim P_h(\cdot| x_h,a_h)$. Note that $r_h$ and $x_{h+1}$ are generated according to the MDP reward and transition functions, and $d^D_h$ fully determines the quality and coverage of the data distribution. For a given policy $\pi$, $w^\pi_h(x_h,a_h):= d^\pi_h(x_h,a_h) / d^D_h(x_h,a_h)$ measures how well $d^D_h$ covers the occupancy induced by $\pi$ at timestep $h$ and is often known as the density-ratio function or the marginalized importance weight. It plays an important role in offline RL algorithms and analyses. As another shorthand, we use notation $w^\pi=(w_0^\pi,\ldots,w_h^\pi)$ to denote the density-ratio function over all timesteps and notation $w^*:= w^{\pi^*}$ to denote the density ratio of the optimal policy.

\paragraph{Function Approximation} We consider the function approximation setting, where we are given a function class $\Fcal=\Fcal_0\times\ldots\times\Fcal_{H-1}$ with $\Fcal_h\subseteq(\Xcal_h\times\Acal\rightarrow\RR),\forall h\in[H]$ and a weight function class $\Wcal=\Wcal_0\times\ldots\times\Wcal_{H-1}$ with $\Wcal_h \subseteq(\Xcal_h\times\Acal\rightarrow\RR), \forall h\in[H]$. We assume these are finite classes and use $\log(|\Fcal|)$ and $\log(|\Wcal|)$ to measure their statistical capacities. The extension to continuous or infinite classes with a covering argument is standard. By default, for any $f\in\Fcal,$ we assume $f_H=\zero$ for technical simplicity and use $V_f$ to denote its induces state-value function, i.e., $V_f(x_h)=\max_{a_h\in\Acal} f_h(x_h,a_h)$. We will also use $\pi_f(x_h)$ instead of $\pi_{f_h}(x_h)$ for simplicity since only $f_h$ operates on $x_h\in \Xcal_h$ and there is no confusion.


\section{Algorithm} \label{sec:alg}

In this section, we introduce our algorithm PABC (Pessimism under Average Bellman error Constraints), whose pseudo-code is given in \pref{alg:pess_alg}. The algorithm takes two steps: a prescreening step (\pref{line:prescreen}), followed by the main step (\pref{line:pess_select}). We first give an intuition for the main step, deferring the explanation of the prescreening step and the related gap definitions to the later part of this section. 


\begin{algorithm}[htb]
	\caption{PABC (Pessimism under Average Bellman error Constraints)\label{alg:pess_alg}}
	\begin{algorithmic}[1]
	    \REQUIRE threshold $\alpha>0$, gap parameter $\cgap$, function class $\Fcal$, weight function class $\Wcal$, and dataset $\Dcal$.
	    \STATE Perform prescreening according to input  $\cgap$: \label{line:prescreen}
	    \begin{align} %\label{line:prescreen}
	    \Fcal(\cgap):=\{f \in \Fcal: \gap(f)\ge\cgap\}.
	    \end{align}%\vspace{-1.5em}
		\STATE Find the pessimism value function in  $\Fcal(\cgap)$  subject to average Bellman error constraints \label{line:pess_select}
		\begin{align}
		\label{eq:constraint}
		&\hat f=\argmin_{f\in\Fcal(\cgap)} f_0(x_0,\pi_{f}(x_0))
		\notag\\
    	\text{s.t.}& \max_{w\in\Wcal,h\in[H]} |\Lcal_{\Dcal}(f,w,h)|\le\alpha,
    	\end{align}
    	where the empirical loss $\Lcal_{\Dcal}(f,w,h)$ is defined as
    	\begin{align}
        \Lcal_{\Dcal}(f,w,h) &%&~=\EE_{\Dcal}[w_h(x_h,a_h)(f_h(x_h,a_h)-r_h-f_{h+1}(x_{h+1},\pi_f(x_{h+1})))].\\
        =\frac{1}{n}\sum_{i=1}^n[w_h(x_h^{(i)},a_h^{(i)})(f_h(x_h^{(i)},a_h^{(i)}) \nonumber\\
        &-r_h^{(i)}-f_{h+1}(x_{h+1}^{(i)},\pi_f(x_{h+1}^{(i)})))]. \label{eq:LD}
        \end{align}
    	%\begin{align}
        %\Lcal_{\Dcal}(f,w,h) &
        %=\frac{1}{n}\sum_{i=1}^n[w_h(x_h^{(i)},a_h^{(i)})(f_h(x_h^{(i)},a_h^{(i)})-r_h^{(i)}-f_{h+1}(x_{h+1}^{(i)},\pi_f(x_{h+1}^{(i)})))]. \label{eq:LD}
        %\end{align}
		\ENSURE policy $\pi_{\hat f}$ and return estimation $\hat f_0(x_0, \pi_{\hat f}(x_0))$.
	\end{algorithmic}
\end{algorithm}


The main step (\pref{line:pess_select}) runs a constrained optimization to select a function $\hat f \in \Fcal$, whose greedy policy is the output. The objective of the optimization minimizes the value at the initial state, which is a form of initial-state pessimism \citep{xie2021bellman,zanette2021provable} and proved to be useful in handling insufficient data coverage. The constraints eliminate functions with large \textit{average Bellman errors} \citep{jiang2017contextual, xie2020q}. 

\paragraph{Average Bellman Error Constraints} To provide intuition, we know that $Q^*$ has $0$ average Bellman errors for all state-action pairs, that is, $\forall h\in[H], x_h\in\Xcal_h, a_h\in\Acal$, $(Q^*_h - \Tcal_h Q^*_{h+1})(x_h, a_h) = 0.$  
Thus it also has $0$ average Bellman errors under any distribution $\nu_h$ at timestep $h$:
$$
\EE_{(x_h,a_h) \sim \nu_h}[(Q^*_{h} - \Tcal_h Q^*_{h+1})(x_h,a_h)] = 0.
$$
This holds even if $\nu_h$ is an unnormalized distribution. Therefore, we can safely eliminate any candidate function $f\in\Fcal$, if it has a large average Bellman error $\EE_{\nu_h}[f_h - \Tcal_h f_{h+1}]$ under any (possibly unnormalized) distribution $\nu_h$. Unlike the more standard versions of Bellman errors such as $\EE_{\nu_h}[(f_h - \Tcal_h f_{h+1})^2]$, which squares the Bellman error in each state before taking expectation and cannot be directly estimated due to the infamous \textit{double-sampling} difficulty \citep{baird1995residual, farahmand2011model}, the average Bellman error can be easily estimated. %, since we do not take square or absolute value in each state and directly take a plain expectation. 
In the algorithm, we consider a variety of  (possibly unnormalized) distributions $\nu_h = w_h \cdot d^D_h$ for $w\in\Wcal,h\in[H]$. Since the average Bellman error can only be empirically approximated (see \Cref{eq:LD}), we relax the constraints and allow a threshold $\alpha$ to incorporate the statistical errors. 
We note that the constraint alone is similar to the MABO algorithm by \citet{xie2020q}, but they do not use pessimism and cannot handle insufficient data coverage. They also assume that $\Wcal$ is sufficiently rich to approximate $w^{\pi_f}$ \textit{for all} $f\in\Fcal$, and a main goal of our work is to avoid such ``for all'' assumptions. 


\paragraph{Gap and Prescreening} As mentioned in the introduction, a key assumption that enables our results is a gap assumption on value functions. To prepare for the discussion, we define the gap of a function as follows:
\begin{definition}[Gap]
\label{def:gap}
For any $f = (f_0, \ldots, f_{H-1})$, we define its gap at timestep $h \in [H]$ and state $x_h\in\Xcal_h$  as follows: If $\argmax_{a_h \in\Acal} f_h(x_h, a_h)$ is unique, % $a_1,a_2\in\Acal, a_1\neq a_2$ s.t. $f_h(x,a_1)=f_h(x,a_2)=\max_{a\in\Acal}f_h(x,a)$), 
then we define $\gap(f;h,x_h):=\min_{a_h\neq \pi_{f}(x_h)}f_h(x_h,\pi_f(x_h))-f(x_h,a_h)$.  Otherwise, we define $\gap(f;h,x_h):=0$.

The gap of $f$ is then defined as
$$\gap(f):=\min_{h\in[H],x_h\in\Xcal_h}\gap(f;h,x_h).$$
\end{definition}
As we see, this definition of the gap is similar to the one used in prior works \citep{simchowitz2019non,mou2020sample,du2019provably,he2021logarithmic,yang2021q,hu2021fast,wang2021exponential,papini2021reinforcement,wu2021gap}, except that we require a unique optimal action for the gap to be non-zero. A motivating example of similar gap assumptions in other areas of RL theory can be found in \citet{wu2021gap}. 
With \pref{def:gap}, we can now define the minimum gap of a function class:
\begin{definition}[Gap of a function class]
\label{def:gapmin_val_class}
Given a function class $\Gcal=\Gcal_0\times\ldots\times\Gcal_{H-1}$, where $\Gcal_h \subseteq(\Xcal_h\times\Acal\rightarrow\RR), \forall h\in[H]$, we define its gap as
$$\gap(\Gcal):=\min_{g\in\Gcal} \gap(g).$$
\end{definition}

Prior theoretical results relying on similar gap assumptions often make such assumptions on the true optimal value function $Q^*$ \citep{simchowitz2019non, yang2021q}. As we will see in our analyses, however, what is really important for us is that the \textit{learned} function $\hat f$ has a large gap, not the true $Q^*$. Since we have no control over which $f$ in the function class will be finally chosen, we perform the prescreening step in \pref{line:prescreen} to eliminate functions with the gap lower than a pre-defined threshold $\cgap \ge 0$. It is immediate to see that $\gap(\Fcal(\cgap))\ge\cgap$. Of course, this runs into the risk of eliminating $Q^*$, and if we do not want any misspecification, we need to ensure $Q^* \in \Fcal(\cgap)$, which requires that $\cgap \le \gap(Q^*)$. For clarity, in \pref{sec:find_near_optimal} we will assume that we have the knowledge of $\gap(Q^*)$ and can set $\cgap$ accordingly, while later in \pref{sec:unknown_gap} we show how to handle unknown $\gap(Q^*)$. Moreover, as we will see in \pref{sec:appx_error}, when we allow misspecification errors in the analysis, $\gap(Q^*)$ and $\gapmin$ become disentangled, which leads to some interesting implications.


\section{Main Guarantees}  \label{sec:main}
In this section, we present the main sample complexity results of our algorithms. We start with a weak version of guarantee by showing that our algorithm can identify $v^*$, the optimal expected return at the initial state, with polynomial samples under realizability and single-policy coverage assumptions, even without any gap assumption (\pref{sec:find_v_star}). Such a result will also be useful when we handle the unknown gap setting later in \pref{sec:unknown_gap}. Then, \pref{sec:counter} provides an algorithm-specific counterexample to show that our algorithm fails to find a near-optimal policy under these assumptions, motivating the necessity of the gap assumption. Finally, \pref{sec:find_near_optimal} provides the main result of this paper under the additional gap assumption.

\subsection{Estimating the Optimal Expected Return}
\label{sec:find_v_star}
We first show how to identify $v^*$, the optimal expected return of the problem, \textit{without} needing the gap assumption. In this case, we will run \pref{alg:pess_alg} with $\gapmin=0$, that is, without the prescreening step. To our knowledge, there is no prior work that can achieve this goal under our weak assumptions.\footnote{We note that under \citet{zhan2022offline}'s assumptions, their algorithm, with regularization removed, can also identify $v^*$.} Despite not producing a near-optimal policy, this procedure and guarantee allows us to check whether any given policy is close to optimal, assuming we can evaluate the policy's performance by off-policy evaluation or a small amount of online interactions. This capability can be very useful especially in certain model selection scenarios \citep[see e.g.,][Section 5]{modi2020sample}. Indeed, we will reuse this result later in \pref{sec:unknown_gap} to handle the unknown gap setting.

We start by introducing the assumptions. The first two are the standard realizability assumptions.

\begin{assum}[Realizability of $\Fcal$]
\label{assum:realizablity_q}
We assume $Q^*=(Q^*_0,\ldots,Q^*_{H-1})\in\Fcal$.
\end{assum}

\begin{assum}[Realizability of $\Wcal$]
	\label{assum:realizablity_w}
    We assume $w^*=(w^*_0,\ldots,w^*_{H-1})\in\Wcal$.
\end{assum}
%\begin{remark}
We make these assumptions exact for now to allow for a clean presentation of the main results and core proof ideas, and defer the handling of misspecification errors to \pref{sec:appx_error}. Also, following the arguments in \citet{uehara2020minimax, xie2020q}, \pref{assum:realizablity_w} can be further relaxed such that $w^*$ only needs to lie in the convex hull of $\Wcal$.  

Next, we introduce the standard boundedness assumptions. 

\begin{assum}[Boundness of $\Fcal$]
	\label{assum:bound_q}
For any $f\in\Fcal$, we assume $f_h\in (\Xcal_h\times\Acal\rightarrow[0,H-h]),\forall h\in[H]$.
\end{assum}

\begin{assum}[Boundness of $\Wcal$]
	\label{assum:bound_w}
	For any $w\in\Wcal$, we assume $\|w_h\|_\infty \le C, \forall h\in[H]$. 
\end{assum}


\pref{assum:realizablity_w} and \pref{assum:bound_w} together immediately imply that our data covers $\pi^*$: 
\[\frac{d^*_h(x_h,a_h)}{d^D_h(x_h,a_h)}\le C,\forall h\in[H], x_h\in\Xcal_h,a_h\in\Acal.
\]
This version of coverage is often called $\pi^*$-concentrability \citep{scherrer2014approximate, xie2021policy, rashidinejad2021bridging, zhan2022offline}. As we will see when we consider misspecification errors in \pref{sec:appx_error}, we do not really need our data to satisfy $\pi^*$-concentrability, and the definition of coverage can be relaxed using the structure and generalization effects of $\Fcal$ similarly to \citet{jin2021pessimism, xie2021policy}. 



With all the above assumptions, we are ready to state our first result formally below. The proof is deferred to \pref{app:proof_find_v_star}. 

\begin{theorem}[Sample complexity of estimating $v^*$]
\label{thm:find_v_star}
Suppose Assumptions \ref{assum:realizablity_q}, \ref{assum:realizablity_w}, \ref{assum:bound_q}, \ref{assum:bound_w} hold and the total samples $nH$ satisfies 
$$
nH\ge \frac{8C^2H^5\log(2|\Fcal||\Wcal|H/\delta)}{\varepsilon^2}.
$$
Then with probability at least $\ge 1-\delta$, running \pref{alg:pess_alg} with $\gapmin=0$ and $\alpha=\varepsilon/(2H)$ guarantees \[|V_{\hat f}(x_0)-v^*|\le \varepsilon.\]
\end{theorem}

\subsection{Algorithm-Specific Counterexample} \label{sec:counter}
Despite being able to identify $v^*$, we show that \pref{alg:pess_alg} cannot be guaranteed to learn a near-optimal policy without further assumptions, even with infinite data. As we will see, a key aspect of the construction is a tie between the values of different actions, so such counterexamples can be effectively excluded by assuming a unique optimal action. 

The counterexample is given in \pref{fig:counter}. 
Circles denote states and arrows denote actions with deterministic transitions, and states without arrows have a default $\mathrm{null}$ action. There are only $+1$ rewards at states $x_C$ and $x_D$, while the rewards are $0$ everywhere else. Taking $\textrm{L}_1$ at $x_0$ deterministically transits to $x_A$ and we omit the remaining specifications as they are clearly indicated in the figure.

\begin{figure}
	\centering
	\includegraphics[scale=1]{Figure/counter_exp.pdf}
	\caption{Algorithm-specific counterexample without the gap assumption.}
	\label{fig:counter}
\end{figure}



It is easy to see that the optimal policy $\pi^*$ takes action $\textrm{L}_1$ at state $x_0$. By adversarial tie breaking, we assume $\pi^*$ takes action $\textrm{L}_2$ at state $x_A$. We construct a bad function $f$, which only differs from $Q^*$ at $(x_0,\textrm{R}_1)$ and $(x_B,\mathrm{null})$ by setting $f_0(x_0,\textrm{R}_1)=1$ and $f_1(x_B,\mathrm{null})=1$. By adversarial tie breaking, we assume $\pi_f(x_0)=\textrm{R}_1$ and $\pi_f(x_A)=\textrm{L}_2$. We immediately have a realizable class $\Fcal=\{Q^*,f\}$. It is easy to verify that $\pi_f$ is not $\varepsilon$-optimal for any $\varepsilon<1$ because it deviates from the optimal branch at $x_0$. In addition, we let data $d^D$ covers $(x_0,\textrm{L}_1)$, $(x_A,\textrm{L}_2)$, $(x_C,\mathrm{null})$. For the weight function, we define an \textit{invalid} weight function $w_{\textrm{bad}}$ that puts all weight on $(x_0,\textrm{R}_1),(x_A,\textrm{L}_2),(x_C,\mathrm{null})$ in each level respectively. Then we also have a realizable class $\Wcal=\{w^*,w_{\textrm{bad}}\}$. 

As $\gap(Q^*)=0$ in this counterexample, no function will be ruled out in the prescreening step (\pref{line:prescreen}). Since both $f$ and $Q^*$ have zero population average Bellman error under $\Wcal$, and $f_0(x_0, \pi_f(x_0)) = Q^*_0(x_0, \pi_{Q^*}(x_0)) = 1$, either of them can be the $\hat f$ learned in \pref{alg:pess_alg}, but returning $\pi_f$ leads to failure of learning. %This implies that both of them satisfy all constraints in \pref{alg:pess_alg}, and they have equal probabilities to be returned. Therefore $f$ (or equivalently $\pi_f$) will be returned with 0.5 probability, and our algorithm fails. 
We note that the reason for failure is that no data covers the state $\pi_f$ visits.

\paragraph{Additional Consistency Constraints} In our setting, there are additional constraints that one can add to ensure some form of consistency. 
For example, for any $f\in\Fcal$, we can additionally require that there exists $w\in\Wcal$ that is consistent with $\pi_{f}$, since $w^* \in \Wcal$ should only give non-zero weight to actions chosen by $\pi_f$ (i.e., $\forall h\in[H], w_h(x_h,a_h)=0 \text{ if } a_h\neq \pi_f(x_h)$). In addition, as we can estimate $v^*$ with the assumptions in \pref{thm:find_v_star} and we know $\EE_{d^D}[w^* \cdot R] = v^*$, we can eliminate any $w \in \Wcal$ that violates this condition. While these constraints are reasonable (or at least harmless) and may be of independent interest, we can verify that they do not help with this counterexample, which implies our algorithm fails even under these additional consistency checks.




\subsection{Learning a Near-Optimal Policy}
\label{sec:find_near_optimal}
As mentioned above, a key aspect of the counterexample is a tie between the values of actions. In this section, we show that a positive gap assumption not only excludes the counterexample, but enables a general guarantee for learning near-optimal policies with our \pref{alg:pess_alg}. 
\begin{assum}[Gap of $Q^*$]
\label{assum:gap_plus}
The gap of $Q^*$ satisfies
\[\gap(Q^*)>0.\]
%\[\gap(Q^*)\ge \gapmin.\]
\end{assum}
Here the implicit assumption is that we want $\gap(Q^*)$ to be sufficiently large, as our later sample complexity guarantees will scale inversely with $\gap(Q^*)$. Note that
\pref{assum:gap_plus} is stronger than the standard gap assumption in the literature \citep{simchowitz2019non, yang2021q, hu2021fast}. Compared with their definition, we additionally assume the optimal action is unique at each state. On the other hand, these papers require additional strong assumptions (e.g., linear MDPs, Bellman-completeness, or pointwise convergence) or focus on the tabular setting, whereas we handle general function approximation in offline RL under weak realizability-type assumptions. Plus, the technical mechanisms under which the gap plays a role in the analyses are very different, so the assumptions are not really comparable.

For now, we assume $\gapq$ is known and will later handle the case of an unknown gap \pref{sec:unknown_gap}. Plus, as a side effect of handling misspecification errors,   \pref{sec:appx_error} will lift the stringent gap assumption in a novel and interesting manner. 

We now state the guarantee of learning a near-optimal policy under the gap assumption. A sketch of proof is provided after the theorem statement, while the complete proof is deferred to \pref{app:proof_main}.

\begin{theorem}[Sample complexity of learning a near-optimal policy]
\label{thm:main}
	Suppose Assumptions \ref{assum:realizablity_q}, \ref{assum:realizablity_w}, \ref{assum:bound_q}, \ref{assum:bound_w}, \ref{assum:gap_plus} hold and the total number of samples $nH$ satisfies 
	\[nH\ge \frac{8C^2H^7\log(2|\Fcal||\Wcal|H/\delta)}{\varepsilon^2 \gapq^2}.
	\]
	Then with probability at least $\ge 1-\delta$, running \pref{alg:pess_alg} with $\alpha=\varepsilon\gapq/(2H^2)$ and $\gapmin=\gapq$ guarantees \[v^{\pi_{\hat f}} \ge v^*-\varepsilon.\]
\end{theorem}





\paragraph{Proof sketch of \pref{thm:main}} 
As standard, all our results depend on a high-probability concentration event, that
%\begin{align*}
$\abr{\Lcal_{\Dcal}(f,w,h)-\EE[\Lcal_{\Dcal}(f,w,h)]}\le \estat$ holds for all $f\in\Fcal, w\in\Wcal, h\in[H]$ with high probability; 
%\end{align*}
%\end{lemma}
the detailed expression of $\estat$ is given in \pref{lem:conc}. From \pref{lem:conc}, for any $f\in\Fcal$ that satisfies all constraints in \pref{alg:pess_alg}, we can guarantee the population loss to be small, that is, 
%\[
$\abr{\EE\sbr{\Lcal_{\Dcal}(f,w,h)}}\le \estat+\alpha.$ 
%\]


The central step of our proof is to use a telescoping argument and the gap assumption to establish the following inequality:
\begin{align*}
&V_0^*(x_0)\ge V_{\hat f}(x_0)\ge V_0^*(x_0)-H(\estat+\alpha)
\\
&\quad+\gapq \EE\left[\sum_{h=0}^{H-1}\one\{\pi_{\hat f}(x_h)\neq \pi^*(x_h)\}\mid \pi^*\right].
\end{align*}
%\begin{align*}
%&V_0^*(x_0)\ge V_{\hat f}(x_0)\ge V_0^*(x_0)-H(\estat+\alpha)+\gapq \EE\left[\sum_{h=0}^{H-1}\one\{\pi_{\hat f}(x_h)\neq \pi^*(x_h)\}\mid \pi^*\right].
%\end{align*}
This implies the policy deviation can be bounded as
\[
\EE\left[\sum_{h=0}^{H-1}\one\{\pi_{\hat f}(x_h)\neq \pi^*(x_h)\}\mid\pi^*\right]\le \frac{H(\estat+\alpha)}{\gapq}.
\]
The LHS of this inequality is the probability that the learned policy $\pi_{\hat f}$ disagrees with the optimal policy $\pi^*$, along the distribution induced by $\pi^*$. From here, we can apply the RL-to-supervised-learning (SL) reduction in imitation learning (e.g., Theorem 2.1 in \citet{ross2010efficient}) to translate it to the final performance difference bound between $\pi_{\hat f}$ and $\pi^*$. We also provide a different proof in \pref{app:proof_main}, which itself may be of independent interest.


\section{Robustness to Misspecification} 
\label{sec:appx_error}
We now consider the case when $Q^*$ and $w^*$ may not exactly belong to $\Fcal$ and $\Wcal$, but can be reasonably approximated up to small errors. More often than not, such robustness results in RL theory are nothing but routine exercises where the proofs are largely straightforward extensions of those for the exact case. In our case, however, the misspecification analyses reveal an interesting phenomenon of disentangling the true gap of $Q^*$ and that of $\Fcal$, and how our gap and coverage assumptions can be relaxed in nontrivial ways. 

We start with defining the approximation errors of our function classes. Inspired by \citet{xie2020q}, we define the approximation error of $\Wcal$ as 
\begin{align}
\label{eq:appx_w}
%\varepsilon_{\Wcal}=~\min_{w\in\Wcal}\max_{f\in\Fcal}\max_{h\in[H]}&|\EE_{d^D_h}[w_h\cdot (f_h-\Tcal_h f_{h+1})] -\EE_{d^*_h}[f_h-\Tcal_h f_{h+1}]|
\varepsilon_{\Wcal}=~\min_{w\in\Wcal}\max_{f\in\Fcal}\max_{h\in[H]}&|\EE_{d^D_h}[w_h\cdot (f_h-\Tcal_h f_{h+1})]\notag
\\
&-\EE_{d^*_h}[f_h-\Tcal_h f_{h+1}]|
\end{align}
and use $\tilde w^*$ to denote the best approximator in $\Wcal$ that obtains the minimum. 
The expression inside the min-max-max measures the difference between $d_h^D \cdot w_h$ and $d_h^*$, using $f_h - \Tcal_h f_{h+1}$ for $f\in\Fcal$ as discriminators. When $w^* \in \Wcal$ (\pref{assum:realizablity_w}), $d_h^D \cdot w_h^* = d_h^*$ (because $w_h^*$ is defined as $d_h^*/d_h^D$), so we have $\varepsilon_{\Wcal} = 0$. However, the opposite direction is \textit{not} always true: 
%A key property of this definition is that $w^*\in\Wcal \Rightarrow \varepsilon_{\Wcal} = 0$, but the opposite is \textit{not} true. 
it is entirely possible to achieve $\varepsilon_{\Wcal} = 0$ when $w^* \notin \Wcal$ (or even when $d_h^*(x_h,a_h)/d_h^D(x_h,a_h) = \infty$ for some $(x_h,a_h)$ and $w^*$ does not exist), as long as $d_h^D \cdot \tilde w^*_h$ and $d^*_h$ can not be distinguished by $f_h - \Tcal_h f_{h+1}$ for $f\in\Fcal$ as \emph{discriminators}.\footnote{The idea of using discriminators has also been explored in \citet{farahmand2017value,sun2019model,modi2020sample,modi2021model}, but the application is different here.} We also provide an example in \pref{app:conc_example}. Note that since our data coverage assumption is implicitly made through the realizability and boundedness of $\Wcal$ (see the discussion below \pref{assum:bound_w}), this means that our data coverage assumption is also relaxed using the information of $\Fcal$, which is a common characteristics of recent results in offline RL (e.g., \citet{xie2021bellman} also use the Bellman error class induced by the value function class as discriminators, which is similar to our definition at a high level), but not enjoyed by the concurrent work of \citet{zhan2022offline}. 



For function class $\Fcal$, we define the approximation error in a way that uses $\Wcal$ as discriminators, plus a term that measures the difference under the initial state $x_0$:
\begin{align}
\label{eq:appx_f}
\varepsilon_{\Fcal}=&\min_{f\in\Fcal}\max_{w\in\Wcal}\max_{h\in[H]}(|\EE_{d^D_h}[w_h\cdot (f_h -\Tcal_h f_{h+1})]|
\notag \\
&+\abr{f_0(x_0,\pi_f(x_0))-Q^*_0(x_0,\pi^*(x_0))})
%\varepsilon_{\Fcal}=&\min_{f\in\Fcal}\max_{w\in\Wcal}\max_{h\in[H]}(|\EE_{d^D_h}[w_h\cdot (f_h -\Tcal_h f_{h+1})]|+\abr{f_0(x_0,\pi_f(x_0))-Q^*_0(x_0,\pi^*(x_0))})
\end{align}
%\[\tilde Q^*_\Wcal =\argmin_{f\in\Fcal}\max_{w\in\Wcal}\abr{\EE_{d^D}[w\cdot (f -\Tcal f)]}.\]
and use $\tilde Q^*_\Fcal$ to denote the best approximator that achieves the minimum value.
Under mild regularity assumptions on $\Wcal$,\footnote{Namely, $w \ge 0$ and $\EE_{d^D}[w] = 1$. The former is trivial and the latter can be easily verified approximately on data.} it is straightforward to show that $\varepsilon_{\Fcal}$ is weaker than $\ell_\infty$ error up to multiplicative constants:
\begin{align}\label{eq:compare_epsF}
\varepsilon_{\Fcal}\le 3\min_{f\in\Fcal}\max_{h\in[H]}\|f_h-Q_h^*\|_{\infty}, 
\end{align}
and a more detailed discussion can be found in \pref{lem:appx_f_vs_infty}. 
Similarly, we can define the function class $\Fcal(\gapmin)$ related approximation error $\varepsilon_{\Fcal(\gapmin)}$ and its best approximator $\tilde Q^*_{\Fcal(\gapmin)}$ by replacing $\Fcal$ with $\Fcal(\gapmin)$ in \Cref{eq:appx_f}.

\subsection{Estimating the Optimal Expected Return}
With the approximation error defined above, we now extend \pref{thm:find_v_star} to the approximate case. Assuming the approximation error $\varepsilon_\Fcal$ (or a reasonably tight upper bound of it) is known, we can relax the constraint to ensure that $\tilde Q^*_{\Fcal(\gapmin)}$ is not eliminated and obtain the sample complexity guarantee as in \pref{thm:find_v_star_appx}. The full proof is deferred to \pref{app:proof_find_v_star_approx}. As before, we do not need the gap assumption to identify $v^*$ approximately and can run the algorithm with $\gapmin = 0$. 

\begin{theorem}[Robust version of \pref{thm:find_v_star}]
\label{thm:find_v_star_appx}
Suppose Assumptions \ref{assum:bound_q}, \ref{assum:bound_w} hold and the total number of samples $nH$ satisfies 
\[
nH\ge \frac{8C^2H^5\log(2|\Fcal||\Wcal|H/\delta)}{\varepsilon ^2}.
\]
Then with probability $1-\delta$, running \pref{alg:pess_alg} with $\alpha=\varepsilon /(2H)+\varepsilon_{\Fcal}$ and $\gapmin=0$ guarantees \[|V_{\hat f}(x_0)-v^*| \le  \varepsilon +H\varepsilon_{\Fcal}+H\varepsilon_\Wcal.\]
\end{theorem}


While we need the knowledge of $\varepsilon_\Fcal$ to set $\alpha$, we do not need to know $\varepsilon_\Wcal$, which shows a difference between the behaviors of $\Fcal$ and $\Wcal$. This is also the case in the next subsection where we try to learn a near-optimal policy. 



\subsection{Learning a Near-Optimal Policy}
Similarly, we can also extend \pref{thm:main} to the misspecified case. Our guarantee is established with a user-specified $\gapmin$ parameter and the approximation error related to prescreened class $\Fcal(\gapmin)$. We provide the sample complexity guarantee in \pref{thm:main_appx} and the complete proof in \pref{app:proof_main_approx}.
\begin{theorem}[Robust version of \pref{thm:main}]
\label{thm:main_appx}
Suppose Assumptions \ref{assum:bound_q}, \ref{assum:bound_w} hold and the total number of samples $nH$ satisfies 
\[nH\ge \frac{8C^2H^7\log(2|\Fcal||\Wcal|H/\delta)}{\varepsilon^2 \gapmin^2}.\]
Then with probability $1-\delta$, running \pref{alg:pess_alg} with a user-specified $\gapmin$ and $\alpha=\varepsilon \gapmin/(2H^2)+\varepsilon_{\Fcal(\gapmin)}$ guarantees \[v^{\pi_{\hat f}} \ge v^*-\varepsilon  - \frac{(H^2+H)\varepsilon_{\Fcal(\gapmin)}+H^2\varepsilon_\Wcal}{\gapmin}.\]
\end{theorem}

\pref{thm:main_appx} gives us a convenient way to set the gap parameter $\gapmin$. We also provide a sample complexity guarantee (\pref{corr:main_appx}) in \pref{sec:corr_main_appx} for the case that $\gapq$ and the $\ell_\infty$ approximation error of $\Fcal$ are known.

\paragraph{Unknown Approximation Errors} Notice that in the robustness results  (\pref{thm:find_v_star_appx} and \pref{thm:main_appx}) we require the knowledge of approximation errors $\varepsilon_\Fcal$ or $\varepsilon_{\Fcal(\gapmin)}$ to set the threshold $\alpha$ in PABC (\pref{alg:pess_alg}). In \pref{app:lang} we show a variant of PABC based on Lagrangians (PABC-L; \pref{alg:pess_lang_alg}) does not require such knowledge, and still enjoys the same sample complexity guarantees. In PABC-L, the original constraints in \Cref{eq:constraint} are moved to the objective, thus the threshold $\alpha$ is no longer needed as the input. We refer the reader to \pref{app:lang} for the formal description of PABC-L and its results and proofs. 

\paragraph{Relaxed Gap Assumption} An outstanding characteristic of \pref{thm:main_appx} is that it no longer depends on $\gap(Q^*)$ explicitly, and only depends on $\gapmin$, a parameter of our choice. Therefore, it may seem to have lifted \pref{assum:gap_plus} that $\gap(Q^*) > 0$, as we can choose $\gapmin$ to be sufficiently large. However, below we show that this issue is more complicated than it may seem, and while our result does relax \pref{assum:gap_plus} in significant ways, it does so in a very nuanced manner. 

First of all, in the worst-case scenario, \pref{assum:gap_plus} is still needed to provide non-vacuous guarantees. This is because, if $Q^*$ has no gap, yet we artificially create a large $\gapmin$ in our prescreened function class $\Fcal(\gapmin)$, we could eliminate all the good approximations of $Q^*$. Among the remaining functions, the best $\ell_\infty$ approximation of $Q^*$ must have an $\ell_\infty$ error no less than $\gapmin$, and if we plug that into the $\varepsilon_{\Fcal(\gapmin)}$ term in the approximation guarantee, the $\gapmin$ on the numerator and the denominator will cancel out, leaving a constant suboptimality gap which makes the guarantee vacuous. 



Having said that, the nuance here is that we do \textit{not} use the most stringent $\ell_\infty$ norm to define $\varepsilon_{\Fcal(\gapmin)}$, but rather use an average notion of error (\Cref{eq:appx_f}), which is possibly much smaller than the $\ell_\infty$ error (\Cref{eq:compare_epsF}). Therefore, there are still cases where $\gap(Q^*) = 0$ yet our result yields nontrivial guarantees. As a concrete example, imagine a $Q^*$ that has large gaps in most states, but the gap is $0$ in a few ``bad'' states. In this case, $\gap(Q^*)$ is $0$. However, there can still exist $\tilde Q_{\Fcal(\gapmin)}^*$ that approximates $Q^*$ well everywhere except on those bad states, and as long as no $w\in\Wcal$ puts significant probabilities on the bad states, we  have $\varepsilon_{\Fcal(\gapmin)} \ll \gapmin$ and hence \pref{thm:main_appx} will provide meaningful guarantees.




\section{Handling the Unknown Gap Parameter with Online Access} \label{sec:unknown_gap}
In this section, we extend the main algorithm and analyses in \pref{sec:main} in a different direction than \pref{sec:appx_error}. In particular, we are concerned about the fact that \pref{thm:main} assumes the knowledge of $\gap(Q^*)$. While it is common for offline RL algorithms to have hyperparameters that need to be tuned separately (and this is particularly the case for version-space-based algorithms \citep{jiang2017contextual, xie2021bellman}), here we show that we can address the unknown $\gap(Q^*)$ issue by a small amount of additional online interactions for Monte-Carlo policy evaluation. This is particularly interesting as our result provides an example of how one can use a small amount of online interactions to mitigate limitations in purely offline learning, a practically relevant problem that is also of great interest to the RL theory community \citep{xie2021policy}. 


\begin{algorithm}[htb]
	\caption{PABC-OA (PABC with Online Access)	\label{alg:unknown_gap}}
	\begin{algorithmic}[1]
	    \STATE \textbf{Input}: function class $\Fcal$, weight function class $\Wcal$, and dataset $\Dcal$ (with size $|\Dcal_h|=n,\forall h\in[H]$).
		\FOR{$t=0,1,\ldots$} %\label{lin:forloop}
		\STATE Set $\mathrm{gap}^{\mathrm{guess}}_t=H/2^t$.
		\STATE Use $n$ and $\mathrm{gap}^{\mathrm{guess}}_t$ to calculate $\varepsilon_t=\sqrt{8C^2H^6\iota(t)/(n(\mathrm{gap}^{\mathrm{guess}}_t)^2)}$, where $\iota(t)=\log(24|\Fcal||\Wcal|H\cdot 2^t/\delta)$.
		\STATE Run \pref{alg:pess_alg} with $\alpha=\varepsilon_t/(2H)$ and get scalar estimation $\hat v^*_t$. \label{line:find_v}
		\STATE Run \pref{alg:pess_alg} with $\alpha=\varepsilon_t\mathrm{gap}^{\mathrm{guess}}_t/(2H^2)$ and $\gapmin=\mathrm{gap}^{\mathrm{guess}}_t$, and get policy $\hat \pi_t$. \label{line:main}
		\STATE Estimate $v^{\hat \pi_t}$ by running Monte Carlo algorithm with $\tilde O(H^3\log(1/\delta)/\varepsilon_t^2)$ online samples and denote the estimate as $\hat v^{\hat \pi_t}$. \label{line:mc}
		\IF {$\hat v^{\hat \pi_t}\ge \hat v^*_t- 3\varepsilon_t$} \label{line:cond}
		\STATE Output $\hat \pi_t$ and terminate. \label{line:check}
		\ENDIF
		\ENDFOR
	\end{algorithmic}
\end{algorithm}%PABC-OA (PABC with Online Access)
As shown in \pref{alg:unknown_gap}, the algorithm PABC-OA (PABC with Online Access) proceeds iteration by iteration. We start with the maximum possible value of the unknown $\gapq$. For simplicity, we choose $H$ here, and alternatively we can also use $\max_{f\in\Fcal} \gap(f)$ which is tighter. In iteration $t$, we use $\mathrm{gap}^\mathrm{guess}_t=H/2^t$ as the guess of $\gapq$ and calculate the desired $\alpha$ according to \pref{thm:find_v_star} to estimate $v^*$ (\pref{line:find_v}), or calculate the desired $\alpha$ and $\gapmin$ according to \pref{thm:main} to find a near-optimal policy (\pref{line:main}). Finally we conduct Monte-Carlo policy evaluation with online samples (\pref{line:mc}). If the stopping condition (\pref{line:cond}) is satisfied, we are guaranteed to learn a near-optimal policy and can terminate (\pref{line:check}). Otherwise, we proceed to the next iteration, shrink our guessed value of $\mathrm{gap}^\mathrm{guess}_t$, and continue the routine. We can observe an interesting connection between \pref{thm:find_v_star} and \pref{thm:main}, and identifying $v^*$ is indeed useful.

It can be shown that \pref{alg:unknown_gap} will terminate once the guessed value $\mathrm{gap}^{\mathrm{guess}}_t=H/2^t$ drops below the true value $\gapq$, which leads to the sample complexity result in \pref{thm:main_unknown}. The formal proof can be found in \pref{app:proof_main_unknown_gap}. 


\begin{theorem}[Sample complexity of learning a near-optimal policy with unknown $\gapq$]
\label{thm:main_unknown}
Suppose Assumptions \ref{assum:realizablity_q}, \ref{assum:realizablity_w}, \ref{assum:bound_q}, \ref{assum:bound_w}, \ref{assum:gap_plus} hold but $\gapq$ is unknown. Assume we have a dataset $\Dcal$ with size $n$ for each $\Dcal_h$ and additional online access to collect \[\tilde O\rbr{\frac{n\log(1/\delta)}{C^2H}}\] samples. 
Then with probability at least $1-\delta$, the output policy $\hat \pi$ from \pref{alg:unknown_gap} satisfies 
\begin{align}
\label{eq:unknonw_q_accu}
v^{\hat \pi}\ge v^* - 5\sqrt{\frac{32C^2H^6\iota(\log(2H/\gapq))}{n\gapq^2}},    
\end{align}
where $\iota(t)=\log(24|\Fcal||\Wcal|H\cdot 2^t/\delta)$.
\end{theorem}
%\begin{remark}
The suboptimality in \Cref{eq:unknonw_q_accu} has the same order (up to polylog terms) as that of running \pref{alg:pess_alg} with known $\gapq$ in \pref{thm:main}. 
If we set this value to be $\varepsilon'$, i.e., $\varepsilon':=5\sqrt{\frac{32C^2H^6\iota(\log(2H/\gapq))}{n\gapq^2}}$, then the number of required online samples is $\tilde O\rbr{\frac{H^5\log(1/\delta)}{(\varepsilon'\gapq)^2}}$, which does not depend on the complexity of the function classes $\Fcal$ and $\Wcal$. 




\section{Discussion and Conclusion} \label{sec:discuss}
We conclude the paper with a detailed discussion of how our work compares to the closely related concurrent work of \citet{zhan2022offline}, which also provides a good summary of our contributions and promising future directions. 


The very recent work of \citet{zhan2022offline} aims at solving the same problem:\footnote{Their results are in the discounted setting whereas ours in the finite horizon setting, but this is a superficial difference and translating each of the results into the other setting is not difficult.} offline RL under only single-policy coverage and realizability assumptions. Similar to our counterexample in \pref{sec:counter}, they also realize the difficulties in the setting where the optimal weight and value functions are realizable in a straightforward manner. Instead of making a gap assumption like we do, they attack the problem from a different angle by introducing regularization into the Lagrangian of the linear program for MDPs. 

Despite that the two approaches have some fundamental differences (which we will elaborate further below), it is still worth comparing the nature of the two results. To this end, our approach has several advantages:
\begin{enumerate}[leftmargin=*]
\item Regularization changes the definition of the value function in \citet{zhan2022offline}. In fact, the function they need to realize does not obey any form of Bellman equations, and probably should not be called value functions anymore. This makes their realizability assumption somewhat difficult to interpret and connect to the existing literature. In contrast, we work with the most standard notion of $Q^*$. 
\item Due to regularization, the policy learned by \citet{zhan2022offline} is generally suboptimal even with infinite data, so the strength of regularization needs to be carefully controlled for the bias-variance trade-off. As a result, when competing with $\pi^*$, their sample complexity rate is $O(1/\varepsilon^6)$, which is much slower than our $O(1/\varepsilon^2)$. 
\item Our coverage assumption can be significantly relaxed using the structure of $\Fcal$; see discussion in \pref{sec:appx_error}.  
While this is standard in recent offline RL works based on Bellman-completeness assumptions \citep{jin2021pessimism, xie2021bellman}, \citet{zhan2022offline}'s guarantee relies on the boundedness of the raw density ratios and does not enjoy such a relaxation.  
\end{enumerate}
That said, \citet{zhan2022offline}'s result is also attractive in several aspects:
\begin{enumerate}[leftmargin=*]
\item They do not require gap assumptions. While similar gap assumptions are standard in RL theory literature, it is unclear how prevalent it is in real problems and how algorithms that depend on gap assumptions perform in problems when the assumption is violated.
\item Our guarantees only hold if the data covers $\pi^*$ (though the notion of coverage can be relaxed using a structure of $\Fcal$, as mentioned above). In comparison, \citet{zhan2022offline} can still provide meaningful guarantees even when $\pi^*$ is not covered by data, in which case they compete with the best policy under data coverage. 
\item Regarding computation, their algorithm is a convex-concave minimax optimization problem when the function classes are convex. In comparison, the computational characteristics of our method are less clear, though we note that a Lagrangian form of our main step (\pref{line:pess_select}) (see \pref{app:lang} for details) is similar to the kind of minimax optimization commonly found in the MIS literature \citep{nachum2019dualdice,uehara2020minimax, yang2020off, jiang2020minimax}. 
\end{enumerate}
 
We reiterate that these comparisons are made only on the results themselves. The two works take fundamentally different approaches and are of independent interests. For example, despite that both works use density-ratio functions, \citet{zhan2022offline}'s method is based on the linear programming (LP)-formulation of MDPs where the optimal state-value function $V^*$ is modeled, whereas we model the optimal Q-function $Q^*$. This difference is more significant than it may seem, as the LP formulation and the Bellman optimality equations for $Q^*$ are very different foundations for designing learning algorithms, and the gap assumption only makes sense for Q-functions and cannot be used in state-value functions. That said, it will be interesting to investigate if the two works can borrow each other's ideas to address their own weaknesses, which we leave to future investigation. 


\begin{acknowledgements} 
The authors thank Akshay Krishnamurthy for helpful discussions. NJ acknowledges funding support from ARL Cooperative Agreement W911NF-17-2-0196, NSF IIS-2112471, NSF CAREER award, and Adobe Data Science Research Award. 
\end{acknowledgements}


%\clearpage

\bibliography{refs}

%\input{appendix}

\end{document}
