% \section{Example of Model}
\onecolumn

We first briefly describe the structure of the Appendix here. In Appendix \ref{appendix:example} we add two more examples in the multi-step settings as supplementary to the example in Section \ref{sec:overfitting}. In Appendix \ref{appendix:theory} we provide the proofs of theorems in Section \ref{sec:overfitting_theory}. In Appendix \ref{appendix:experiment-details}, we include more experiment details. In Appendices~\ref{ap:more-exp-tumor},~\ref{ap:more-exp-sepsis},~\ref{ap:more-exp-cartpole} and~\ref{ap:more-exp-d4rl} we include more results in the considered domains including experiments with estimating the behavior policy with function approximation and experiments with an alternative policy selection procedure with best intermittent policy checkpoint and the D4RL dataset. In the real world dataset on ICU sepsis treatment, we also include in Appendix~\ref{ap:sepsis-ablation} an ablation study without ESS constraints for hyperparameter selection on the validation set and in Appendix~\ref{sec:effect_delta} an investigation of the effect of eligible action constraints $\delta$. In Appendix\ref{sec:is_low} we also investigate the the weight given by different methods to states with low observed outcomes, and we conduct experiments on the differences in the methods under the prism of ESS and performance in Appendix~\ref{ap:sepsis-tradeoff}. Finally, in Appendix~\ref{ap:sepsis-action-viz} we include visualizations of eligible actions for high/mid/low-SOFA patients in addition to a timestep-by-timestep visualization of the two action constraints considered in this paper (based on the eligible action set in \ispg and based on the probability under the behavior policy for other methods).


\section{Counter Examples in RL Settings}
\label{appendix:example}
In the main text, we gave an example about the overfitting issue in contextual bandits with large state and action space in small datasets. Here we show that it is even easier for this to occur in sequential reinforcement learning settings, even when only 2 actions are available in the next two examples with or without state aliasing.
\begin{example}
\label{example:rl}
Consider a sequential treatment problem as shown in Figure \ref{fig:overfitting_example}. There are two actions available in each state. From the first state,
action $a_1$ has a 50\% chance of leading to an immediate terminal positive reward $r=1$ and a 50\% chance of leading to an immediate terminal negative reward $r=-1$. From the first state, action $a_2$ also has 50\% chance of leading to an immediate terminal positive reward $r=1$. For the other 50\% of states, action $a_2$ results in transitions to additional states, which are followed by additional actions, for another $H-1$ steps; however, all transitions eventually end in a large negative outcome (e.g., $r=-5$). For example, one could consider a risky surgical procedure that results in many subsequent additional operations and but is ultimately typically unsuccessful. Assume the behavior policy is uniform over each action, yielding  $\mu(a=0|x_1) = \mu(a=1|x_1)$ = 0.5 and a probability of each action sequence following $a_2$ of $\frac{1}{|A|^{H-1}}$. With even minimal data the value of $\pi(x_0)=a_1$ will be accurately estimated as 0. However, when  $H$ is large relative to a function of the dataset size, there always exists a action sequence after an initial selection of $a_2$ that is not observed in the dataset. This means that a policy $\pi_2$ that starts with $\pi(x_0)=a_2$ and then selects an unobserved action sequence will essentially put 0 weight on the resulting contexts that incur $r=-5$ outcomes, even though such outcomes will occur 50\% of the time after taking action $a_2$.  In this case, the value of $\pi_2$  will be overestimated significantly by IS or self-normalized IS. Thus the offline policy optimization will prefer taking action 2 at the first step as a result of overfitting even though the true value of first taking $a_2$ is $-1.5$ and the optimal policy value is $0$, obtained by taking action $a_1$. 
%We need to make the decision between two actions in the first step. Both actions have a positive treatment effect on half of the patients leading to $r=1$. Action 1 has a side effect ($r=-1$) on the other half of the patients. Action 2 has a stronger side effect ($r=-5$) on the other half of the patients, however, not immediately observed. These patients will be treated with $H-1$ more steps with $A$ actions, leading to $A^{H-1}$ different action sequences. In this example, data are draw from an uniform random policy $\mu(a=0|x_1) = \mu(a=1|x_1)$ = 0.5, and the probability of each action sequence following $a=2$ is $\frac{1}{A^{H-1}}$. When $A$ and $H$ are large, there always exists a sequence that is not observed in the dataset. In this case, the policy taking action 2 at the first step and the unseen path in the following steps will be overestimated significantly by IS or self-normalized IS. Thus the offline policy optimization will prefer taking action 2 at the first step as a result of overfitting. 
\end{example}

%Again we see the policy tries to avoid the logged actions on the sick patients, such that these patients contribute no weights in the weighted return policy evaluation. However, as a result of the previous action, these sick patients are not avoidable in the true environment. Thus this way of optimizing the weights is an overfitting in the offline dataset. 

Now we add a slight change in the transitions shown in Figure \ref{fig:overfitting_example}. We can see that model/value-based approach will also fail.

\begin{example}
\label{example:rl_model_fail}
In this example, we add another action in the first step. The action $3$ and action $1$ will lead to the same next state. However in the next state, no matter which action taken, the reward will depends on the action taken in the last step: If $a_1 = 1$, then we have the same reward for $a=1$ in the example in Figure \ref{fig:overfitting_example}. If $a_1 = 3$ then we have a reward $-5$. Thus model and value based method will mix the reward for $a_1 = 1$ and $a_1 = 3$ so fail in this example. Other method is not affected by the additional structure as it only add an action with minimum reward.
\end{example}

\begin{figure*}[th]%{R}{0.8\textwidth}
%\vspace{-\intextsep}
%\hspace*{0.5cm}
    \begin{minipage}{0.33\textwidth}
    \centering
    \includegraphics[width=0.9\textwidth]{images/example_patient.pdf}
    \subcaption{Example \ref{example:rl}.}
    \label{fig:overfitting_example}
    \end{minipage}
    \begin{minipage}{0.33\textwidth}
    \centering
    \includegraphics[width=0.9\textwidth]{images/example_patient_model_fail.pdf}
    \subcaption{Example \ref{example:rl_model_fail}.}
    \label{fig:example_patient_model_fail}
    \end{minipage}
    \begin{minipage}{0.33\textwidth}
    \centering
    \includegraphics[width=0.9\textwidth]{images/toy_example_result.png}
    \subcaption{Number of samples to solve Example 2/3 for $A=2$.}
    \label{fig:overfitting_example_result}
    \end{minipage}
\end{figure*}
% \begin{figure}[ht]
%     \centering
%     \includegraphics[width=0.5\textwidth]{example_patient_model_fail.pdf}
%     \caption{A non-Markov variant of example in Figure \ref{fig:overfitting_example}}
%     \label{fig:example_patient_model_fail}
% \end{figure}

\clearpage
\section{Proofs of Section \ref{sec:overfitting_theory} }
\label{appendix:theory}
Proof of Theorem \ref{thm:weights_lowerbound}.
\begin{proof}
\begin{align}
    \sum_{ x_{h}^{(j)} \in \Bcal(\ith{x_h},\threshold) } \frac{\pi(a_{h}^{(j)}|x_{h}^{(j)})}{\mu(a_{h}^{(j)}|x_{h}^{(j)})} \ge& \sum_{ x_{h}^{(j)} \in \Bcal(\ith{x_h},\threshold) } \pi(a_{h}^{(j)}|x_{h}^{(j)}) \\
    \ge&  \sum_{ x_{h}^{(j)} \in \Bcal(\ith{x_h},\threshold) } \max \{0, \pi(a_{h}^{(j)}|x_{h}^{(i)}) - \threshold L \} \\
    \ge& \sum_{a \in \Aset_h(\ith{x_h};\Dcal,\threshold)} \max \{0, \pi(a|x_{h}^{(i)}) - \threshold L \} = 1 - \threshold L |\Acal|
\end{align}
\end{proof}

Proof of Corollary \ref{cor:onestep_weights_lowerbound}.
\begin{proof}
\begin{align}
     \sum_{x_{1}^{(j)} \in \Bcal(\ith{x_1},\threshold)} \frac{\max\{\ith{W},M \}}{
     \sum_{i=1}^n \max\{ \ith{W},M \} } &\ge \sum_{x_{1}^{(j)} \in \Bcal(\ith{x_1},\threshold)} \frac{\max\{\ith{W},M \}}{
     nM } \\
     &\ge \frac{\max\{\sum_{x_{1}^{(j)} \in \Bcal(\ith{x_1},\threshold)} \ith{W},M \}}{
     nM }\\ 
     &\ge \frac{1-\threshold L |\Acal|}{nM}
\end{align}
\end{proof}

Proof of Proposition \ref{prop:nstep_necessity}.
\begin{proof}
This is due to $\pi(a|\ith{x_h})$ and $\mu(a|\ith{x_h})$ are independent from history given $\ith{x_h}$. So $W_{1:h}^{(i)}$ and $\ith{W_h}$ are conditionally independent given $x_{h}^{(i)}$.
\end{proof}

Proof of Theorem \ref{thm:consistV}.
\begin{proof}
Let $\Prob_h(x;\mu) $ to be the distribution of context at $h$-th step with roll-in policy $\mu$. For any fixed $a$, we can define the distribution $\Prob_h(x|a;\mu) = \mu(a|x)\Prob_h(x;\mu)/\sum_{a} \mu(a|x)\Prob_h(x;\mu) $. For $a$ such that $\mu(a|x) > 0$, $\Prob_h(x|a;\mu)$ is also greater than zero. All $\ith{x_h}$ with $\ith{a_h} = a$ are i.i.d. samples draw from the distribution $\Prob_h(x;\mu)$. By the property of nearest neighbor \citep{cover1967nearest}, with probability 1: $$\min_{\ith{x_h} s.t. \ith{a_h} = a} \dist(x,\ith{x_h}) \to 0 < \threshold. $$
That means with probability $1$ $a \in \Aset_h(x;\Dcal,\threshold)$ for all $a$ such that $\mu(a|x) > 0$. Thus we proved the theorem statement and that the policy class will contain all $\pi$ such that $\pi(a|x) > 0$ if $\mu(a|x) > 0$. % 
%Now we construct a set of context $\{x_h^{(j)} s.t.(x_h^{(j)}, a) \in \Dcal \}$
\end{proof}

Proof of Theorem \ref{thm:consistency}.
\begin{proof}
Given the overlap assumption and Theorem \ref{thm:consistV}, for all $\pi$ we have $a \in \Aset_h(x;\Dcal,\threshold)$ for all $a$ such that $\pi(a|x) > 0$ with probability 1. Thus the solution to Equation \ref{eq:poela_objective} is the same as $\argmax_{\pi} J(\pi,\Dcal) := \hat{\pi}_{J,\Dcal}$.

By the condition that $M \to \infty$ and $\frac{M}{n} \to 0$ as $n \to \infty$, we have that the truncated IS estimator is mean square consistent \citep{ionides2008truncated}: %\sum_{i=1}^n \min\left\{   \prod_{h=1}^{H} \ith{W_h}, M \right\}
\begin{align}
    \frac{1}{ n } \sum_{i=1}^n \left( \sum_{h=1}^H \ith{r_h} \right) \min\left\{   \prod_{h=1}^{H} \ith{W_h}, M \right\}  \xrightarrow[]{q.m.} v^\pi, 
\end{align}
as $n \to \infty$. Similarly, we have that the mean of weights converge to $1$ in quadratic mean:
\begin{align}
   \frac{1}{n}\sum_{i=1}^n \min\left\{   \prod_{h=1}^{H} \ith{W_h}, M \right\} \xrightarrow[]{q.m.} 1.
\end{align}
By continuous mapping theorem, we have that the self-normalized truncated IS converge to $v^\pi$ in probability $ \wtis \xrightarrow[]{p} n  $.
The empirical variance penalty, also converge to $0$ almostly surely, since $M/n$ converge to $0$:
\begin{align}
    \frac{\sum_{i=1}^n \left(\ith{r} - \wtis \right)^2 (\min\{\ith{W},M \})^2 }{\left( \sum_{i=1}^n  \min\{\ith{W},M \}\right)^2} \le \frac{M^2}{\left( \sum_{i=1}^n  \min\{\ith{W},M \}\right)^2} \xrightarrow[]{q.m.} 0.
\end{align}
Thus the objective function $J(\pi;\Dcal)$ converge to $v^\pi$ in probability:
\begin{align}
    \Pr \left( |J(\pi;\Dcal) - v^\pi| > \epsilon \right) = \delta_n \to 0.
\end{align}
Since we assume $|\Pi|<\infty$, we have
\begin{align}
    \Pr \left( \forall \pi \in \Pi \, |J(\pi;\Dcal) - v^\pi| > \epsilon \right) = |\Pi|\delta_n. 
\end{align}
So with probability $|\Pi|\delta_n $, for any $\epsilon$:
\begin{align}
    v^{\hat{\pi}_{J,\Dcal}} \ge J(\hat{\pi}_{J,\Dcal}, \Dcal) - \epsilon \\
    \ge J(\pi^{\star}, \Dcal) - \epsilon \\
    \ge v^{\pi^{\star}} - 2\epsilon, 
\end{align}
where $\pi^\star$ is $\argmax_{\pi \in \Pi} v^\pi $. As $|\Pi|\delta_n \to 0$, we proved the true value of empirical maximizer $v^{\hat{\pi}_{J,\Dcal}}$ converge to the maximum of value $\max_{\pi \in \Pi} v^\pi$ in probability.
\end{proof}

\section{Experiment Details}
\label{appendix:experiment-details}
\textbf{For all experiments in the main text}, we report the test performance of the policy saved at the end of training either through online Monte-Carlo estimation if a simulator is available, or using SNTIS estimates on a held out test set.

\textbf{For all experiments reported in Appendices~\ref{ap:tumor-checkpoint} and~\ref{ap:sepsis-checkpoint}}, we follow the 3-phases pipelines we describe hereafter to decide the test score we report in the corresponding Tables. To put ourselves in the more realistic situation of real-world applications where practitioners would select a policy from regular checkpoints along its training on the basis of its SNTIS score on the validation set, an algorithm is trained on the training set multiple times, using different hyperparameters and several restarts. Intermittent policies generated during the training process identified with the highest self-normalized truncated IS (SNTIS) estimates on a held-out validation set are saved at checkpoints. The pipeline is illustrated in Figure \ref{fig:exp_flow}.

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.9\textwidth]{images/exp_flow.png}
    \caption{The process of hyperparameters search and test in the experiment.}
    \label{fig:exp_flow}
\end{figure}

The open-source code for \ispg can be found here: \href{https://github.com/StanfordAI4HI/poela}{https://github.com/StanfordAI4HI/poela}.

\subsection{Experiment Details in TGI Simulator}
The TGI simulator describes low-grade gliomas (LGG) growth kinetics in response to chemotherapy in a horizon of 30 months using an ordinary differential equation model. The parameter in ODEs are estimated using data from adult diffuse LGG during and after chemotherapy was used, in a horizon of 30 months. The goal in this environment is to achieve a reduction in mean tumor diameters (MTD) while reducing the drug dosage \citep{yauney2018reinforcement}. We includes the MTD, the drug concentration, and the number of month (time-step) in the context space. Notice that this context space is non-Markov as it does not include all parameters in the ODEs. Actions are binary representing taking the full dose or no dose which is same as prior work \citep{yauney2018reinforcement}. The reward at each step consist of an immediate penalty proportional to the drug concentration, and a delayed reward at the end measures the decrease of MTD compared with the beginning. Each episodes, the parameters including the initial MTD are sampled from a log-Normal distribution as \citep{ribba2012tumor} representing the difference in individuals. The behavior policy is a fixed dosing schedule of 9 months (the median duration from \cite{peyre2010prolonged}) plus $30\%$ of a uniformly random choice of actions. We run all algorithms on a training set with 1000 episodes with different hyperparameters (listed below), and 5 restarts, saving checkpoints along the training.% Then we select the best policy for each algorithm by $\wtis$ on the validation set with 1000 episodes as well.
The validation set is comprised of 1000 episodes as well. 

\paragraph{Hyperparameters.} In the first part of Table \ref{tab:hp_tumor} we show the searched hyperparameters of each algorithm, except that the parameter $b$ in PQL is set adaptively as the $2$-percentile of the score on the training set as in the original paper \cite{liu2020provably}. As we know the behavior policy, we use the true behavior policy in BCQ and PQL algorithm. So BCQ threshold takes only two values as the behavior policy is $\epsilon$-deterministic so there are only two distinct values. In the second part of Table \ref{tab:hp_tumor} we specify some fixed hyperparameters/settings for all algorithm. All policy/Q functions are approximated by fully-connected neural networks with two hidden layers with 32 units.

\begin{table}[ht]
    \centering
    \begin{tabular}{c|cc}
    \toprule
       Hyperparameters  & used in algorithms & values  \\
       \midrule
       $\delta$ & \ispg & $0.05, 0.1, 0.5$ \\
        $\hat{\mu}$ threshold & \ispgknn & 0.01, 0.05, 0.1, 0.2 \\
       CRM Var coefficient & \ispg, \ispgbaseline & $0, 0.1, 1$ \\
       BCQ threshold & BCQ, PQL & $0.0, 0.2$ \\ 
       \midrule
       $M$ in $\wtis$ & \text{All} & $1000$ \\
       Max training steps & \ispg, \ispgbaseline, \ispgknn & 500 \\
       & BCQ, PQL & 1000 \\
       Number of checkpoints & All & 50 \\
       Batch size & BCQ, PQL & 100 \\
       \bottomrule
    \end{tabular}
    \vspace{0.5cm}
    \caption{Hyperparameters in the TGI simulator experiment}
    \label{tab:hp_tumor}
\end{table}
The difference in the max update steps and checkpoints frequency is caused by the fact that BCQ and PQL is updated by stochastic gradient descent and all policy optimization based on SNTIS is using gradient descent.

% A single training and validation run of an algorithm with takes no more than half an hours on Intel Xeon CPU with 2.40 GHz.

\subsection{Experiment Details in the MIMIC III Dataset}
\label{ap:sepsis-details}
The MIMIC III sepsis dataset is available upon application and training: https://mimic.mit.edu/iii/gettingstarted/. The code to extract the cohort is available on: https://gitlab.doc.ic.ac.uk/AIClinician/AIClinician. This cohort consists of data for 14971 patients. The contexts for each patient consist of 44 features, 
%including demographics, Elixhauser premorbid status, vital signs, laboratory values,
summarized in 4-hour intervals, for at most 20 steps. The actions we consider are the prescription of IV fluids and vasopressors. Each of the two treatments is binned into 5 discrete actions according to the dosage amounts, resulting in 25 possible actions. The rewards are defined from the 90-day mortality in the logs, $100$ if the patient survives  and $0$ otherwise.

We now provide details of the experiment on MIMIC III sepsis dataset here. We run all algorithms on a training set with 8982 trajectories with different hyperparameters (listed below), and 3 restarts, saving checkpoints along the training.% Then we select the best policy for each algorithm by $\wtis$ on the validation set with 2994 trajectories.
The validation set is comprised of 2994 trajectories. Finally we get the $\wtis$ evaluation on the test set with 2995 trajectories. In the first part of Table~\ref{tab:hp_sepsis} we list the hyperparameters that we searched on the validation set for each algorithm, except that the parameter $b$ in PQL is set adaptively as the $2$-percentile of the score on the training set as in the original paper~\citep{liu2020provably}. In the second part of Table~\ref{tab:hp_tumor} we specify some fixed hyperparameters/settings for all algorithm. All policy/Q-functions are approximated by fully-connected neural networks with two hidden layers with 256 units.

\begin{table}[ht]
    \centering
    \begin{tabular}{c|cc}
    \toprule
       Hyperparameters  & used in algorithms & values  \\
       \midrule
       $\delta$ & \ispg & $0.4, 0.6, 0.8, 1.0$ \\
       $\hat{\mu}$ threshold & \ispgknn & 0.01, 0.02, 0.05, 0.1 \\
       CRM Var coefficient & \ispg, \ispgbaseline & $0, 0.1, 1, 10$ \\
       BCQ threshold & BCQ, PQL & $0.0, 0.01, 0.05, 0.1, 0.3, 0.5$ \\ 
       \midrule
       $M$ in $\wtis$ & \text{All} & $1000$ \\
       Max training steps & \ispg, \ispgbaseline, \ispgknn & 1000 \\
       & BCQ, PQL & 10000 \\
       Number of checkpoints & All & 100  \\
       Batch size & BCQ, PQL & 100 \\
       \bottomrule
    \end{tabular}
    \vspace{0.5cm}
    \caption{Hyperparameters in the MIMIC III sepsis experiment}
    \label{tab:hp_sepsis}
\end{table}

As we explained, the difference in the max update steps and checkpoints frequency is caused by the fact that BCQ and PQL is updated by stochastic gradient descent and all policy optimization based on SNTIS is using gradient descent.

% A single training and validation run of an algorithm with takes no more than 2 hours on Intel Xeon CPU with 2.40 GHz.

\subsection{Experiment Details for the Behavior policy Estimation}
\label{ap:bc-details}
In the implementation of BC, we use Multi-Layer Perceptrons (MLPs) neural networks with layer dimensions [32, 32, 32] for the LGG Tumor Growth Inhibition simulator and [256, 256, 256] for the MIMIC III dataset. All use ReLU activations. For BCRNN, we use 3-layer GRUs with a RNN hidden dimension of size 100. All networks are trained using Adam optimizer~\citep{kingma2014adam} with learning rate $3e-4$. For all experiments, BC and BCRNN are trained for 500 steps and directly serve as estimated behavior policies.


\subsection{Importance weights in low-reward trajectories}
\label{sec:is_low}
To examine if the proposed overfitting phenomenon exists in real experimental datasets, we compute the importance weights of the learned policy on the low-reward  trajectories in the training data for our MIMIC III dataset and our tumor simulator. Our hypothesis is that overfitting of the importance weights in policy gradient methods may result in the algorithm avoiding initial states with low rewards, which motivated our proposed algorithm. 
%We show this on the training set to reflect our hypothesis of overfitting importance weights in policy gradient methods. 

%For the completeness of this results and ablation study reason, we also shows this for the batch Q learning baselines (BCQ and PQL). Though the Q learning methods might not overfit the importance weights. 

In MIMIC III dataset the reward for a trajectory is either $0$ or $100$. We define the low-reward trajectories as those with $0$ reward. Low-reward trajectories are over $60\%$ of all trajectories in the dataset. In the Tumor simulation experiment we define a  low-reward trajectory  when reward is less than $-2$. Over $95\%$ of trajectories in the Tumor simulation dataset are low-reward trajectories. 

%The hyperparameters for each algorithm is the same as the selected one in the main experimental section. %**EB: we are only evaluating the learned policy on these states so I think the hyperparameters should be the same by default

The table below shows, for each algorithm and setting, the sum of the SNTIS weights of the learned policy on the training set, for low-reward trajectory states. Our primary interest is to illustrate that alternate policy gradient methods that are also suitable for non-Markov domains, can exhibit the importance sampling overfitting of avoiding low reward trajectories. We indeed see in Table~\ref{tab:training_weights} that POELA has a much larger weight on low-reward trajectories than  alternate offline policy search methods: 

\begin{table}[tbh]
    \centering
    \begin{tabular}{c|ccc}
        \toprule 
         Method & \ispg& \ispgknn & \ispgbaseline  \\
         \midrule 
        MIMIC III & 0.028 & 0.001 & 0.003 \\%& 0.148 & 0.149 \\
        Tumor non-MDP & 0.054 & - (fixed policy) & 0.005 \\ %& 0.0003 & 0.005 \\
        %Tumor MDP & 0.097 & - (fixed policy) & 0.0004 & 0.083 & 0.124 \\
         \bottomrule
    \end{tabular}
    \vspace{0.1cm}
    \caption{ Importance weights overfitting: sum of SNTIS weights of learned policy on the training set. }
    \label{tab:training_weights}
\end{table} 


The Q-learning baselines we consider (BCQ and PQL) do not directly use the importance weights, but they do try to avoid actions and/or states and actions with little support. Our POELA method can be viewed as being similarly inspired, but for non-Markovian settings where policy gradient is beneficial. We also compute the SNTIS weights of the BCQ/PQL policy on the training set in the Markov domain that satisfies the Markov assumptions of BCQ/PQL. In Table~\ref{tab:training_weights_mdp} we can see that  POELA, BCQ and PQL all still give significantly more weight to low reward trajectories than the alternate policy gradient methods:


\begin{table}[tbh]
    \centering
    \begin{tabular}{c|ccccc}
        \toprule 
         Method & \ispg& \ispgknn & \ispgbaseline & BCQ  & PQL \\
         \midrule 
        Tumor MDP & 0.097 & - (fixed policy) & 0.0004 & 0.083 & 0.124 \\
         \bottomrule
    \end{tabular}
    \vspace{0.1cm}
    \caption{ Importance weights overfitting: sum of SNTIS weights of learned policy on the training set. }
    \label{tab:training_weights_mdp}
\end{table} 




%\begin{table}[tbh]
%\color{red}
 %   \centering
 %   \begin{tabular}{c|ccccc}
  %      \toprule 
%         Method & \ispg& \ispgknn & \ispgbaseline & BCQ  & PQL \\
 %        \midrule 
  %      MIMIC III & 0.028 & 0.001 & 0.003 & 0.148 & 0.149 \\
   %     Tumor non-MDP & 0.054 & - (fixed policy) & 0.005 & 0.0003 & 0.005 \\
%        Tumor MDP & 0.097 & - (fixed policy) & 0.0004 & 0.083 & 0.124 \\
 %        \bottomrule
 %   \end{tabular}
 %   \vspace{0.1cm}
 %   \caption{\color{red} Importance weights overfitting: sum of SNTIS weights of learned policy on the training set. }
 %   \label{tab:training_weights}
%\end{table} 



These results help illustrate that the over avoidance of low-reward trajectories can be observed by past policy gradient methods in our datasets. Of course, one challenge is that in real settings, an excellent policy may have low importance weights in avoidable low-reward states and trajectories, but should have higher importance weights in non-avoidable low reward starting states and trajectories. To get a fuller picture of performance, it is helpful to look both at the weights on trajectories with low rewards and the test evaluation results.  
%We also want to emphasize that the splitting of low-reward/high-reward uses the reward information. Unlike the illustrative example, in this case, we cannot know that if it really learns a better decision or just avoids a particular state which should not be affected by no matter what decisions. In fact, a really good policy will also have very low importance weights in the low-reward samples. These results need to be combined with the test evaluation results to understand the overfitting phenomenon. 
Compared with strong policy gradient baselines, our proposed regularization method have larger  importance weights on low-reward trajectories, and the gap between training/validation evaluation and online test performance is also smaller, suggesting that we are less likely to learn policies that erroneously believe they can avoid unavoidable low reward settings. 

\subsection{The effect of eligible action constraints $\delta$}
\label{sec:effect_delta}
In this section we explore how the choice of $\delta$, which constrains the policy class through impacting the eligible actions, impacts empirical performance. Larger $\delta$ corresponds to a less constrained policy class.  %To show how much the proposed eligible action constraints on actions affect the performance in practice. We shows an ablation study of the training and test SNTIS score given hyper-parameter $\delta$'s value. The higher $\delta$ is, the less constraints we make on actions. 
Other hyperparameters are selected by the same procedure as described in previous sections.

Table~\ref{tab:sepsis_delta_study} shows the results. As $\delta$ increases, the policy search operates with less constraints. The results show that in this case, our 
policy gradient method produces a policy with a higher value in the training set, but that policy may not perform as well in the test evaluation, and may have a smaller effective sample size than when a smaller $\delta$ is used. 
The best hyperparameter value $\delta$ lies in the middle of the explored range. $\delta$ can be selected based on  performance and effective sample size.

\begin{table}[tbh]
    \centering
    \begin{tabular}{l|ccccc}
        \toprule 
         $\delta$ & 0.4& 0.6 & 0.8 & 1.0 \\
         \midrule 
        training $\wtis$ & 91.62 & 98.41 & 98.9 & 99.12 \\
        training ESS & 3601.12 & 2242.07 & 1993.08 &1769.46\\
        test $\wtis$ & 86.62 & 90.07 & 91.46 & 90.23  \\
        test ESS  & 1278.08 & 819.64 & 624.92 & 542.53 \\
         \bottomrule
    \end{tabular}
    \vspace{0.1cm}
    \caption{ The effect of eligible action constraints $\delta$ on the results in MIMIC III sepsis dataset. }
    \label{tab:sepsis_delta_study}
\end{table} 


\clearpage
\section{Additional experiments: LGG Tumor Growth Inhibition simulator}
\label{ap:more-exp-tumor}
In this section, we provide additional experiments to the existing LGG Tumor Growth Inhibition simulator experiments.

\subsection{Experiment with estimating the behavior policy with function approximation}
\label{ap:tumor-bc-exp}
\begin{table*}[ht]
    \centering
    \begin{small}
    \tabcolsep=0.1cm % reduce table size by reducing column separation
    \begin{tabular}{c|c|ccccc|c}
        \toprule 
         & Algorithms & \ispg & \ispgmuhat & \ispgbaseline & BCQ & PQL & $9$-mon \\
         \midrule 
         Non-MDP & Test $v^\pi$ & $ 92.34 \pm 1.57 $ & $ 59.62 \pm 12.71 $ & $ 46.66 \pm 14.05$ &  $ 19.36 \pm 5.66 $ &  $30.44 \pm 10.38 $& $68.12 $ \\
         & $\wtis - v^\pi$ & $ 0.94 \pm 1.66 $ & $31.38 \pm 10.97$ & $42.98 \pm 12.87$ &  $72.35 \pm 5.66$&  $62.24 \pm 10.94$ & $-$ \\
        \midrule
        MDP & Test $v^\pi$ & $91.04 \pm 0.55 $ & $78.21 \pm 4.94 $ & $78.70 \pm 0.60 $ &  $99.26 \pm 0.59 $ &  $99.66 \pm 0.29 $ & $68.12$ \\
        & $\wtis - v^\pi$ & $3.40 \pm 2.48 $ & $15.58 \pm 3.92 $ & $15.10 \pm 3.97 $ &  $ -3.88 \pm 1.60 $&  $ -4.09 \pm 1.75 $ & $-$ \\
         \bottomrule
    \end{tabular}
    \end{small}
    \caption{LGG Tumor Growth Inhibition simulator. Test $v^\pi$ and amount of overfitting of the learned policy. Test $v^\pi$ is computed from 1000 rollouts in the simulator. $\wtis$ on the validation set $-$ test $v^\pi$ represents the amount of overfitting. %Drug dosing and treatment effect measured by MTD change is computed from the test rollouts. 
    All numbers are averaged across 5 runs with the standard error reported. Behavior policy $\hat{\mu} = \text{BC}$.}
    \label{tab:tumor_result_bc}
    \vspace{-0.3cm}
\end{table*}


\begin{table*}[ht]
    \centering
    \begin{small}
    \tabcolsep=0.1cm % reduce table size by reducing column separation
    \begin{tabular}{c|c|ccccc|c}
        \toprule 
         & Algorithms & \ispg & \ispgmuhat & \ispgbaseline & BCQ & PQL & $9$-mon \\
         \midrule 
         Non-MDP & Test $v^\pi$ & $ 95.81 \pm 1.68 $ & $76.64 \pm 14.65 $ & $76.43 \pm 14.59 $ &  $ 19.79 \pm 5.76  $ &  $ 31.57 \pm 10.63   $ & $68.12 $ \\
         & $\wtis - v^\pi$ & $-1.52 \pm 1.79 $ & $16.35\pm 14.40 $ & $16.56\pm 14.36 $ &  $ 73.71 \pm 6.34   $ &  $ 62.92  \pm 11.19   $ & $-$ \\
        %  Drug dosing & $5.31 \pm 0.31 $ & $3.81 \pm 1.00$ & $\textbf{0.80} \pm 0.72$ & $2.56 \pm 0.56$ & $9$ \\
        \midrule
        MDP & Test $v^\pi$ & $89.25 \pm 1.51 $ & $75.43 \pm 8.25 $ & $73.61 \pm 0.30 $ &  $ 99.57\pm0.29  $ &  $99.96 \pm 0.12  $ & $68.12$ \\
        & $\wtis - v^\pi$ & $5.17 \pm 2.20 $ & $17.66 \pm 7.90  $ & $19.44 \pm 8.52 $ &  $-4.18 \pm 1.76  $ & $ -4.38 \pm 1.78  $ & $-$ \\
         \bottomrule
    \end{tabular}
    \end{small}
    \caption{LGG Tumor Growth Inhibition simulator. Test $v^\pi$ and amount of overfitting of the learned policy. Test $v^\pi$ is computed from 1000 rollouts in the simulator. $\wtis$ on the validation set $-$ test $v^\pi$ represents the amount of overfitting. %Drug dosing and treatment effect measured by MTD change is computed from the test rollouts. 
    All numbers are averaged across 5 runs with the standard error reported. Behavior policy $\hat{\mu} = \text{BCRNN}$.}
    \label{tab:tumor_result_bcrnn}
    \vspace{-0.3cm}
\end{table*}


\subsection{Alternative selection procedure: checkpoint best intermittent policies}
\label{ap:tumor-checkpoint}
In this section, we use the procedure of best policy checkpoint during the training described in Section~\ref{appendix:experiment-details}. We report the test performance of the selected policy through online Monte-Carlo estimation.


\begin{table*}[h!]
    \centering
    \begin{small}
    \tabcolsep=0.1cm % reduce table size by reducing column separation
    \begin{tabular}{c|c|ccccc|c}
        \toprule 
         & Algorithms & \ispg & \ispgknn & \ispgbaseline & BCQ & PQL & $9$-mon \\
         \midrule 
         Non-MDP & Test $v^\pi$ & $ 92.20 \pm 1.63 $ & $76.99 \pm 13.80$ & $75.06 \pm 13.22$ & $57.77 \pm 16.71$  & $74.76 \pm 9.75$  & $68.12 $ \\
         & $\wtis - v^\pi$ & $ -1.26 \pm 1.92$ & $16.07 \pm 13.55$ & $15.57 \pm 13.07$ & $37.55 \pm 16.91$  & $17.74 \pm 9.49$ & $-$ \\
        %  Drug dosing & $5.31 \pm 0.31 $ & $3.81 \pm 1.00$ & $\textbf{0.80} \pm 0.72$ & $2.56 \pm 0.56$ & $9$ \\
        \midrule
        MDP & Test $v^\pi$ & $89.52 \pm 1.55$ & $69.18 \pm 10.17$ & $78.79 \pm 6.42$ & $94.7\pm3.49$  & $96.88 \pm 3.76$  & $68.12$ \\
        & $\wtis - v^\pi$ & $ 5.16 \pm 1.78$ & $24.92 \pm 9.71$ & $14.93 \pm 5.71$ & $2.75 \pm 3.41$  & $-0.26 \pm 4.18$  & $-$ \\
         \bottomrule
    \end{tabular}
    \end{small}
    \caption{LGG Tumor Growth Inhibition simulator. Test $v^\pi$ and amount of overfitting of the learned policy. Test $v^\pi$ is computed from 1000 rollouts in the simulator. $\wtis$ on the validation set $-$ test $v^\pi$ represents the amount of overfitting. %Drug dosing and treatment effect measured by MTD change is computed from the test rollouts. 
    All numbers are averaged across 5 runs with the standard error reported. \textbf{Procedure: best intermittent policy checkpoints.}}
    \label{tab:tumor_result_checkpoint}
    \vspace{-0.3cm}
\end{table*}


\begin{table*}[h!]
    \centering
    \begin{small}
    \tabcolsep=0.1cm % reduce table size by reducing column separation
    \begin{tabular}{c|c|ccccc|c}
        \toprule 
         & Algorithms & \ispg & \ispgmuhat & \ispgbaseline & BCQ & PQL & $9$-mon \\
         \midrule 
         Non-MDP & Test $v^\pi$ & $ 94.16 \pm 1.82 $ & $ 74.76 \pm 7.66 $ & $ 76.38 \pm 7.26$ & $92.92 \pm 1.68 $  & $74.65 \pm 14.5 $ & $68.12 $ \\
         & $\wtis - v^\pi$ & $ 0.95 \pm 1.92 $ & $18.02 \pm 7.07$ & $15.02 \pm 6.68$ & $0.58 \pm 0.27$ & $20.49 \pm 14.46$  & $-$ \\
        \midrule
        MDP & Test $v^\pi$ & $91.81 \pm 1.05 $ & $84.86 \pm 3.48 $ & $84.08 \pm 3.46 $ & $86.22 \pm 9.61 $  & $95.02 \pm 4.95 $ & $68.12$ \\
        & $\wtis - v^\pi$ & $2.7 \pm 2.82 $ & $9.23 \pm 3.85 $ & $10.01 \pm 3.88 $ & $11.03 \pm 10.4 $ & $2.45 \pm 5.27 $  & $-$ \\
         \bottomrule
    \end{tabular}
    \end{small}
    \caption{LGG Tumor Growth Inhibition simulator. Test $v^\pi$ and amount of overfitting of the learned policy. Test $v^\pi$ is computed from 1000 rollouts in the simulator. $\wtis$ on the validation set $-$ test $v^\pi$ represents the amount of overfitting. %Drug dosing and treatment effect measured by MTD change is computed from the test rollouts. 
    All numbers are averaged across 5 runs with the standard error reported. Behavior policy $\hat{\mu} = \text{BC}$. \textbf{Procedure: best intermittent policy checkpoints.}}
    \label{tab:tumor_result_bc_checkpoint}
    \vspace{-0.3cm}
\end{table*}


\begin{table*}[h!]
    \centering
    \begin{small}
    \tabcolsep=0.1cm % reduce table size by reducing column separation
    \begin{tabular}{c|c|ccccc|c}
        \toprule 
         & Algorithms & \ispg & \ispgmuhat & \ispgbaseline & BCQ & PQL & $9$-mon \\
         \midrule 
         Non-MDP & Test $v^\pi$ & $ 96.34 \pm 1.58 $ & $77.51 \pm 13.87 $ & $75.73 \pm 14.3 $ & $92.73 \pm 1.67 $  & $74.94 \pm 14.47 $  & $68.12 $ \\
         & $\wtis - v^\pi$ & $-2.05 \pm 1.9 $ & $15.48\pm 13.62 $ & $17.27\pm 14.02 $ & $0.77\pm 0.52 $  & $20.2\pm 14.43 $  & $-$ \\
        %  Drug dosing & $5.31 \pm 0.31 $ & $3.81 \pm 1.00$ & $\textbf{0.80} \pm 0.72$ & $2.56 \pm 0.56$ & $9$ \\
        \midrule
        MDP & Test $v^\pi$ & $90.06 \pm 1.65 $ & $79.62 \pm 7.82 $ & $79.54 \pm 7.65 $ & $86.38 \pm 9.47 $  & $95.16 \pm 4.9 $  & $68.12$ \\
        & $\wtis - v^\pi$ & $4.46 \pm 2.31 $ & $13.81 \pm 6.96  $ & $13.89 \pm 6.81 $ & $10.87 \pm 10.24 $  & $2.33 \pm 5.19 $  & $-$ \\
         \bottomrule
    \end{tabular}
    \end{small}
    \caption{LGG Tumor Growth Inhibition simulator. Test $v^\pi$ and amount of overfitting of the learned policy. Test $v^\pi$ is computed from 1000 rollouts in the simulator. $\wtis$ on the validation set $-$ test $v^\pi$ represents the amount of overfitting. %Drug dosing and treatment effect measured by MTD change is computed from the test rollouts. 
    All numbers are averaged across 5 runs with the standard error reported. Behavior policy $\hat{\mu} = \text{BCRNN}$. \textbf{Procedure: best intermittent policy checkpoints.}}
    \label{tab:tumor_result_bcrnn_checkpoint}
    \vspace{-0.3cm}
\end{table*}

\section{Additional experiments: MIMIC III sepsis}
\label{ap:more-exp-sepsis}

In this section we provide additional experiments to the existing MIMIC III sepsis experiments.


\subsection{Alternative selection procedure: checkpoint best intermittent policies}
\label{ap:sepsis-checkpoint}
In this section, we use the procedure of using checkpoints to select best policies during the training described in Section~\ref{appendix:experiment-details}. We report the test performance of the selected policy using SNTIS estimates on a held out test set.

\begin{table*}[h!]
    \centering
    \begin{small}
    \begin{tabular}{c|ccccc|c}
        \toprule 
         Method & \ispg& \ispgmuhat & \ispgbaseline & BCQ  & PQL & Clinician \\
         \midrule 
        Test SNTIS & 91.46 (90.82) & 87.95 & 87.71 & 82.67 & 84.40 & 81.10 \\
        $95\%$ BCa UB & 93.24 (92.61) &	90.58 &		90.04 & 86.83 &	88.29 & 82.19 \\
        $95\%$ BCa LB & 89.59 (88.68) & 84.77 &	84.90 & 78.25	& 80.13 & 79.80\\
        Test ESS & 624.92 (586.37) & 372.00 & 399.59 & 228.82 & 231.93 & 2995 \\
         \bottomrule
    \end{tabular}
    \end{small}
    \caption{MIMIC III sepsis dataset. Test evaluation, $(0.05, 0.95)$ BCa bootstrap interval, and effective sample size. The value of \ispg without a CRM variance penalty is shown in parentheses. \textbf{Procedure: best intermittent policy checkpoints.}}
    \label{tab:sepsis_result_checkpoint}
    \vspace{-0.3cm}
\end{table*}

\begin{table*}[h!]
    \centering
    \begin{small}
    \begin{tabular}{c|ccccc|c}
        \toprule 
         Method & \ispg& \ispgmuhat & \ispgbaseline & BCQ  & PQL & Clinician \\
         \midrule 
        Test SNTIS & $85.01$ ($89.62$) & $84.70$ &	$85.53$ & $83.17$& $84.16$ & 81.10 \\
        $95\%$ BCa UB & $88.61$ ($92.75$) & $88.56$ & $87.80$ & $92.88$	& $88.04$ & 82.19 \\
        $95\%$ BCa LB & $80.55$ ($85.57$) & $80.15$ & $83.23$ & $63.98$	& $79.98$ & 79.80\\
        Test ESS & $227.92$ ($214.12$) & $228.97$ &	$354.86$ & $208.92$	& $209.72$ & 2995 \\
         \bottomrule
    \end{tabular}
    \end{small}
    \caption{MIMIC III sepsis dataset. Test evaluation, $(0.05, 0.95)$ BCa bootstrap interval, and effective sample size. The value of \ispg without a CRM variance penalty is shown in parentheses. Behavior policy $\hat{\mu} = \text{BC}$. \textbf{Procedure: best intermittent policy checkpoints.}}
    \label{tab:sepsis_result_bc_checkpoint}
    \vspace{-0.3cm}
\end{table*}


\begin{table*}[h!]
    \centering
    \begin{small}
    \begin{tabular}{c|ccccc|c}
        \toprule 
         Method & \ispg& \ispgmuhat & \ispgbaseline & BCQ  & PQL & Clinician \\
         \midrule 
        Test SNTIS & $88.34$ ($90.89$) & $87.98$ & $85.12$	& $83.20$	& $85.06$ & 81.10 \\
        $95\%$ BCa UB & $91.65$ ($93.78$) & $91.06$ & $92.75$ & $91.56$	& $89.12$ & 82.19 \\
        $95\%$ BCa LB & $83.94$ ($87.05$) & $84.41$ & $72.96$ & $66.27$	& $79.76$ & 79.80\\
        Test ESS & $201.49$ ($220.86$) & $285.82$ & $211.20$ & $206.11$ & $212.36$ & 2995 \\
         \bottomrule
    \end{tabular}
    \end{small}
    \caption{MIMIC III sepsis dataset. Test evaluation, $(0.05, 0.95)$ BCa bootstrap interval, and effective sample size. The value of \ispg without a CRM variance penalty is shown in parentheses. Behavior policy $\hat{\mu} = \text{BCRNN}$. \textbf{Procedure: best intermittent policy checkpoints.}}
    \label{tab:sepsis_result_bcrnn_checkpoint}
    \vspace{-0.3cm}
\end{table*}


\subsection{Ablation study: ESS constraints for hyperparameter selection on validation set}
\label{ap:sepsis-ablation}
In the main text, we set an effective sample size threshold of 200 for a policy/hyperparameter to be selected on validation set. This is to make sure we have large enough effective sample size on the test set to provide reliable off-policy test estimates. In Table~\ref{tab:sepsis_noess_result}, we show the results if we do not threshold the effective sample size on validation set. Generally, all algorithms will prefer a high off-policy estimates without enough effective sample size. On the test set, all algorithms yields a small effective sample size, thus unreliable off-policy estimates and large bootstrap confidence interval. The proposed methods is better than baselines but also has much smaller $95\%$ bootstrap lower bound than with the effective sample size constraint.

\begin{table}[ht]
    \centering
    \begin{tabular}{c|ccccc}
        \toprule 
         Method & \ispg& \ispgknn & \ispgbaseline & BCQ  & PQL \\
         \midrule 
        Test SNTIS & 87.63(86.29) & 82.36 & 82.36 & 83.28 & 96.32 \\
        $95\%$ BCa LB & 85.06(83.51) &	64.92 &		63.48 &				56.65 &	57.25 \\
        $95\%$ BCa UB & 90.00(88.59) &	94.22 &		93.62 &				100 &	100\\
        Test ESS & 528.18(491.71) &	21.23 &		21.23 &				9.04 &	1.27 \\
         \bottomrule
    \end{tabular}
    \vspace{0.1cm}
    \caption{Test evaluation without effective sample size constraint on the validation set, $(0.05, 0.95)$ BCa bootstrap interval, and effective sample size in the sepsis cohort of MIMIC III dataset. Value inside parenthesis of \ispg is without CRM variance penalty. }
    \label{tab:sepsis_noess_result}
    \vspace{-0.5cm}
\end{table}

% \newpage
\subsection{The trade-off between ESS and performance estimates}
\label{ap:sepsis-tradeoff}
A tension in conservative offline optimization is that the most reliable and conservative policy estimates come from effectively imitating the behavior policy (which will maximize ESS). Policies that differ substantially from the behavior policy may yield higher performance, but have less overlap with the existing logged data (and lower ESS). This is illustrated in Figure~\ref{fig:sepsis_tradeoff}, where the value estimates are plotted for each hyperparameter and re-start of the different algorithms. We observe that \ispg achieves a better Pareto frontier between performance estimates and ESS than other algorithms. Note that for this experiment we placed ourselves in the policy selection procedure in which the best policy is selected during training based on SNTIS estimates on the validation set (cf. Table~\ref{tab:sepsis_result_checkpoint}).


\begin{figure*}[ht]%{R}{0.8\textwidth}
    \centering
    \includegraphics[width=0.6\textwidth]{images/mimic_ess_tradeoff.png}
    \caption{Trade-off between ESS and value estimates.}
    \label{fig:sepsis_tradeoff}
\end{figure*}

\subsection{Eligible actions visualization for high/mid/low-SOFA patients}
\label{ap:sepsis-action-viz}
In this section, we explore the learned policies for patients with high logged SOFA scores (measuring organ failure) in the test dataset. Figure~\ref{fig:sepsis_highsofa_visualization} illustrates the number of actions taken by different policies and the clinicians. \ispg mainly takes treatments similar to the clinician's but more concentrated on high-vasopressors treatments, while \ispgbaseline and value-based methods take treatments different from the logged clinician decisions, suggesting these policies may be overfitting to avoid contexts with high SOFA. However, some patients arrive with high SOFA scores and a policy must have suitable treatments to support such individuals, which our method appears to ensure. For completeness, we also show the visualization of mid-SOFA ($5-15$) and low-SOFA ($<5$) patient contexts in Figures~\ref{fig:sepsis_midsofa_visualization} and~\ref{fig:sepsis_lowsofa_visualization}.

\begin{figure}[h!]%{R}{0.8\textwidth}
    \begin{minipage}{\textwidth}
    \centering
    \includegraphics[width=0.8\textwidth]{images/highsofa_main.png}
    \subcaption{Action counts in high-SOFA contexts}
    \label{fig:sepsis_highsofa_visualization}
    \end{minipage} \\
    \begin{minipage}{\textwidth}
    \centering
    \includegraphics[width=0.8\textwidth]{images/midsofa_main.png}
    \subcaption{Action counts in mid-SOFA contexts}
    \label{fig:sepsis_midsofa_visualization}
    \end{minipage} \\
    \begin{minipage}{\textwidth}
    \centering
    \includegraphics[width=0.8\textwidth]{images/lowsofa_main.png}
    \subcaption{Action counts in low-SOFA contexts}
    \label{fig:sepsis_lowsofa_visualization}
    \end{minipage} 
    \caption{(a): Action counts heatmap in high-SOFA contexts of the policy learned from different algorithms. (b): Action counts heatmap in mid-SOFA contexts of the policy learned from different algorithms. (c): Action counts heatmap in low-SOFA contexts of the policy learned from different algorithms.}
\end{figure}

\clearpage
\section{Additional experiments: OpenAI Gym environment CartPole}
\label{ap:more-exp-cartpole}

In this experiment, we collect a dataset by training DQN~\citep{mnih2013playing} on the task and saving trajectories of horizon 200 steps at regular checkpoints during the training. The dataset is composed of a mixture of sub-optimal and expert data totalling 20000 transitions. For the non-Markov modification, we keep the \textit{Cart Position}, \textit{Cart Velocity} and \textit{Pole Angle} observations but remove the \textit{Pole Angular Velocity} element. In Table~\ref{tab:hp_cartpole}, we report the hyperparameter used in the experiments.

\begin{table}[ht]
    \centering
    \begin{tabular}{c|cc}
    \toprule
       Hyperparameters  & used in algorithms & values  \\
       \midrule
       $\delta$ & \ispg & $0.0001, 0.0005, 0.001, 0.005, 0.01$ \\
       $\hat{\mu}$ threshold & \ispgknn & 0.05, 0.1, 0.15, 0.2 \\
       CRM Var coefficient & \ispg, \ispgbaseline & $0, 0.1, 1, 10$ \\
       BCQ threshold & BCQ, PQL & $0.0, 0.05, 0.1, 0.2, 0.5$ \\ 
       \midrule
       $M$ in $\wtis$ & \text{All} & $1000$ \\
       Max training steps & \ispg, \ispgbaseline, \ispgknn & 500 \\
       & BCQ, PQL & 1000 \\
       Number of checkpoints & All & 50  \\
       Batch size & BCQ, PQL & 64 \\
       \bottomrule
    \end{tabular}
    \vspace{0.5cm}
    \caption{Hyperparameters in the CartPole experiment.}
    \label{tab:hp_cartpole}
\end{table}
 
 
\subsection{Standard evaluation procedure: use policy at the end of training}

\begin{table*}[ht]
    \centering
    \begin{small}
    \begin{tabular}{c|ccccc|c}
        \toprule 
         Method & \ispg& \ispgmuhat & \ispgbaseline & BCQ  & PQL & Behavior policy\\
         \midrule 
        Test SNTIS & 88.29 (86.62) & 78.79 & 72.63 & 21.28 & 23.61 & 41.41\\
        $95\%$ BCa UB & 89.70 (89.81) & 83.87 & 76.77 & 24.63 & 27.14 & 45.04\\
        $95\%$ BCa LB & 85.93 (85.57) & 69.64 & 68.15 & 16.22 & 20.36 & 38.16\\
        Test ESS & 43.32 (40.78) & 30.51 & 30.13 & 30.11 & 30.08 & 248\\
         \bottomrule
    \end{tabular}
    \end{small}
    \caption{CartPole dataset. Test evaluation, $(0.05, 0.95)$ BCa bootstrap interval, and ESS. The value of \ispg without a CRM variance penalty is shown in parentheses.}
    \label{tab:cartpole_results}
    \vspace{-0.3cm}
\end{table*}
\begin{table*}[ht]
    \centering
    \begin{small}
    \begin{tabular}{c|ccccc|c}
        \toprule 
         Method & \ispg& \ispgmuhat & \ispgbaseline & BCQ  & PQL & Behavior policy\\
         \midrule 
        Test SNTIS & 76.18 (72.21) &68.39 & 67.14 & 12.13 & 5.46 & 41.41\\
        $95\%$ BCa UB & 89.27 (88.32) & 80.22 & 83.72 & 12.89 & 6.63& 45.04\\
        $95\%$ BCa LB & 68.97  (67.49) & 57.13 & 57.78 & 9.17 & 5.02 & 38.16\\
        Test ESS & 36.41 (34.72) & 34.56 & 31.87 & 31.22 & 30.07 & 248\\
         \bottomrule
    \end{tabular}
    \end{small}
    \caption{Non-MDP CartPole dataset. Test evaluation, $(0.05, 0.95)$ BCa bootstrap interval, and ESS. The value of \ispg without a CRM variance penalty is shown in parentheses.}
    \label{tab:cartpole_pomdp_results}
    \vspace{-0.3cm}
\end{table*}

\subsection{Alternative selection procedure: checkpoint best intermittent policies}


\begin{table*}[ht]
    \centering
    \begin{small}
    \begin{tabular}{c|ccccc|c}
        \toprule 
         Method & \ispg& \ispgmuhat & \ispgbaseline & BCQ  & PQL & Behavior policy\\
         \midrule 
        Test SNTIS &  88.43 (87.56) & 76.01 & 82.25 & 17.74 & 17.83 & 41.41\\
        $95\%$ BCa UB & 90.46 (90.72) & 82.87 & 86.18 & 21.80 & 21.84 & 45.04\\
        $95\%$ BCa LB & 85.48 (84.63) & 66.21 & 74.30 & 12.84 & 13.26 & 38.16\\
        Test ESS & 43.32 (39.66) & 31.04 & 30.87 & 30.29 & 30.18 & 248\\
         \bottomrule
    \end{tabular}
    \end{small}
    \caption{CartPole dataset. Test evaluation, $(0.05, 0.95)$ BCa bootstrap interval, and ESS. The value of \ispg without a CRM variance penalty is shown in parentheses. \textbf{Procedure: best intermittent policy checkpoints.}}
    \label{tab:cartpole_results_checkpoint}
    \vspace{-0.3cm}
\end{table*}
\begin{table*}[ht]
    \centering
    \begin{small}
    \begin{tabular}{c|ccccc|c}
        \toprule 
         Method & \ispg& \ispgmuhat & \ispgbaseline & BCQ  & PQL & Behavior policy\\
         \midrule 
        Test SNTIS & 75.76 (75.70) &68.66 & 66.34 & 11.73 & 5.70 & 41.41\\
        $95\%$ BCa UB & 92.35 (89.16) & 79.56 & 82.46 & 12.49 & 6.71& 45.04\\
        $95\%$ BCa LB & 68.34  (66.08) & 55.49 & 57.50 & 7.98 & 5.08 & 38.16\\
        Test ESS & 37.72 (35.27) & 35.15 & 36.02 & 30.12 & 31.77 & 248\\
         \bottomrule
    \end{tabular}
    \end{small}
    \caption{Non-MDP CartPole dataset. Test evaluation, $(0.05, 0.95)$ BCa bootstrap interval, and ESS. The value of \ispg without a CRM variance penalty is shown in parentheses. \textbf{Procedure: best intermittent policy checkpoints.}}
    \label{tab:cartpole_pomdp_results_checkpoint}
    \vspace{-0.3cm}
\end{table*}

% \clearpage
\section{Additional experiments: D4RL}
\label{ap:more-exp-d4rl}
Although our primary focus is on application areas where the Markov assumption may not be correct or unverifiable, we also compare to an additional standard benchmark, namely D4RL.

An adaptation of the POELA algorithm is necessary to work with continuous action spaces. Practically, instead of using the eligible action set $A_h$, for each data sample, we pre-compute a set of similar actions and use the distance to the closest state $x_h$ associated with the most similar action distributions in the dataset as a smooth penalty in Line 5 of Algorithm~\ref{alg:is_policy_optimization}.

For each dataset quality (random, medium, and expert) and task (Hopper and Walker2D), we report the performances scaled from 0 to 100 (0 corresponds to the average returns of a random policy and 100 that of an expert policy) following the experimental protocol for D4RL with 200 episodes in each dataset. We compare with state-of-the-art methods in this dataset. The results are reported in Table~\ref{tab:d4rl_results}.

\begin{table*}[ht]
    \centering
    \begin{small}
    \begin{tabular}{c|ccc|c}
        \toprule 
         Dataset & \ispg & BCQ  & CQL & Behavior policy\\
         \midrule 
        Hopper-random & 10.5 & 10.5 & 10.8 & 9.8\\
        Hopper-medium & 43.7 & 42.9 & 41.4 & 29.0\\
        Hopper-expert & 58.9 & 59.7 & 52.6 & 43.6\\
        Walker2D-random & 6.1 & 4.6 & 5.4 & 1.6\\
        Walker2D-medium & 33.8 & 31.1 & 49.6 & 6.6\\
        Walker2D-expert & 32.2 & 32.8 & 54.7 & 50.2\\
         \bottomrule
    \end{tabular}
    \end{small}
    \caption{Additional experiments on 6 D4RL datasets.}
    \label{tab:d4rl_results}
    \vspace{-0.3cm}
\end{table*}

The results in Table~\ref{tab:d4rl_results} suggest that POELA performs similarly to two other state-of-the-art methods in this setting, even though POELA does not make Markov assumptions, which are made and leveraged in BCQ and CQL.