\documentclass[accepted]{uai2022}

\usepackage[american]{babel}
\usepackage{natbib}
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools}
\usepackage{amssymb}
\usepackage{siunitx}
\usepackage{graphicx}
\usepackage{float}
\usepackage{booktabs}

\DeclareMathOperator*{\argmax}{arg\,max}
\def\IS{\mathrm{IS}}
\def\WIS{\mathrm{WIS}}
\def\cA{\mathcal{A}}
\def\cD{\mathcal{D}}
\def\cH{\mathcal{H}}
\def\cX{\mathcal{X}}
\def\cZ{\mathcal{Z}}
\def\th{\tilde{h}}
\def\tH{\tilde{H}}
\def\tz{\tilde{z}}
\def\tZ{\tilde{Z}}
\def\E{\mathbb{E}}
\def\bbR{\mathbb{R}}
\def\hp{\hat{p}}

% Cross references
\newcommand{\appReg}{A}
\newcommand{\appProtValue}{A.1}
\newcommand{\appIgnorability}{A.2}
\newcommand{\appExpDetails}{B}
\newcommand{\appSepsisSim}{B.2}

\title{Case-Based Off-Policy Evaluation Using Prototype Learning}

\author[1]{Anton~Matsson}
\author[1]{Fredrik~D.~Johansson}

\affil[1]{Chalmers University of Technology}

\begin{document}

\maketitle

% ----------------------------------------------------------
% -- ABSTRACT ----------------------------------------------
% ----------------------------------------------------------

\begin{abstract}
    Importance sampling (IS) is often used to perform off-policy evaluation but it is prone to several issues---especially when the behavior policy is unknown and must be estimated from data. Significant differences between target and behavior policies can result in uncertain value estimates due to, for example, high variance. Standard practices such as inspecting IS weights may be insufficient to diagnose such problems and determine for which type of inputs the policies differ in suggested actions and resulting values. To address this, we propose estimating the behavior policy for IS using prototype learning. The learned prototypes provide a condensed summary of the input-action space, which allows for describing differences between policies and assessing the support for evaluating a certain target policy. In addition, we can describe a value estimate in terms of prototypes to understand which parts of the target policy have the most impact on the estimate. We find that this provides new insights in the examination of a learned policy for sepsis management. Moreover, we study the bias resulting from restricting models to use prototypes, how bias propagates to IS weights and estimated values and how this varies with history length.
\end{abstract}

% ----------------------------------------------------------
% -- INTRODUCTION ------------------------------------------
% ----------------------------------------------------------

\section{Introduction}
\label{sec:intro}

Historical data on decisions and outcomes provide opportunities for evaluating  policies for future decision-making. For example, the prospect of using patient records to evaluate new policies for medication dosing in sepsis management has attracted recent attention~\citep{komorowski2018artificial,gottesman2019guidelines}. An example of off-policy evaluation (OPE), this amounts to estimating the value of a target policy based on data gathered under a different so-called behavior policy; see e.g.,~\citet{thomas2015safe} for an overview.

Importance sampling (IS) methods~\citep{precup2000eligibility} perform OPE by weighting observed outcomes by the density ratio of the target policy and the behavior policy. IS methods are often preferred over alternatives which rely on modeling outcomes or covariate transitions, due to their simplicity and the fact that behavior policies often are controllable or human made. Similarly, the equivalent strategy of inverse-propensity weighting is fundamental to the study of causal effects~\citep{rosenbaum1983central,hirano2003efficient}.

In practice, it is difficult to assess the quality of an IS value estimate. When the behavior policy is unknown and must be estimated from data, conditions which guarantee good estimates are hard to meet and rely on untestable assumptions~\citep{rosenbaum2010design,namkoong2020off}. Standard practices of inspecting weights~\citep{li2019addressing} and removing outliers~\citep{crump2009dealing} give only aggregate or per-sample perspectives on potential issues and are often insufficient for domain experts to reason about the validity of the result. There is a clear need to better inspect and diagnose importance sampling estimates.

In this paper, we propose estimating the unknown behavior policy using prototype learning~\citep{li2018deep,ming2019prototypes}. The learned prototypes are selected cases from the input data, readily interpretable by a domain expert and representative of the input-action space. In healthcare applications, the prototypes are trajectories of former patients, and a prototype-based estimate of the behavior policy is analogous to how physicians use experience from previous patients to treat new ones. While offering transparency, a prototype model is flexible enough to model behavior policies in large and/or sequential input spaces.

Our main contribution is to use learned prototypes as an OPE diagnostic tool. In addition to enabling interpretation of individual predictions, we show that (a) prototypes can be used to describe areas of similarities/dissimilarities between behavior and target policies; and (b) prototypes induce a soft clustering which can be used to explain differences in value for different policies. We elaborate on this idea in Section \ref{sec:protovalues} and demonstrate our method in Section \ref{sec:sepsisexp} using an example of sepsis management. Further, we study the added bias of restricting the model class to use prototypes and how this bias propagates to the IS weights in Section \ref{sec:analysis}. 

% ----------------------------------------------------------
% -- OFF-POLICY EVALUATION ---------------------------------
% ----------------------------------------------------------

\section{Off-Policy Evaluation}
\label{sec:ope_def}

Policy evaluation refers to estimating the \emph{value} $V(\pi)$ of a \emph{target policy} $\pi \in \Pi$, as defined below. We focus on the sequential case, where a policy is used to select an \emph{action} $A\in\cA=\{1, \ldots, k\}$ after a \emph{history} $H \in \cH$, comprising a sequence of previous actions and \emph{contexts} $X\in\cX$. The history until time $t$ is defined as $H_{t}\coloneqq(X_{0}, A_{0}, X_{1}, A_{1}, \ldots, X_{t})$, with $H_{0}=X_{0}$. A policy $\pi : \cH \rightarrow \Delta_\cA$ is a map from a history to a distribution over $\cA$. In a medical example, a context $X$ could correspond to information about a patient's state, an action $A$ to a medical intervention, and the target policy $\pi$ to new clinical guidelines.

The value of a policy $\pi$ is defined as the expectation of a \emph{reward} or \emph{outcome} $R\in \bbR$, accumulated after acting according to $\pi$. Here, we study the special case where a single reward is awarded at the end of the sequence, $R = R_T$, but our results generalize to the case where rewards are given after every action. Under the distribution $p_\pi(X_0, A_0, \ldots, X_T, A_T, R) = p_\pi(H_T, A_T, R)$, induced by the policy $\pi$, the value is $V(\pi) \coloneqq \E_\pi[R]$.

Estimating $V(\pi)$ is trivial given a large enough number of samples from the target policy $p_\pi$. In off-policy evaluation (OPE), we have access to no such samples, but must estimate $V(\pi)$ using an observational dataset of $m$ samples $\cD = ((h^1_{t_1}, a^1_{t_1}, r^1), \ldots, (h^m_{t_m}, a^m_{t_m}, r^m))$, drawn according to a distribution  $p_{\mu}(H_T,A_T,R)$, controlled by a \emph{behavior policy} $\mu \in \Pi$. In the medical example, the behavior policy represents current clinical practice. In this work, the behavior policy $\mu$ is unknown and an estimate  $\hat{p}_{\mu}(A\mid H)$ is learned from the samples $\cD$.

A common method for OPE is \emph{importance sampling} (IS).\footnote{Importance sampling estimators are often also referred to as ``importance weighting'' estimators.} The IS estimator uses an estimate $\hp_\mu$ in a weighted average over the samples $\cD$~\citep{hanna2019importance}:
\begin{equation}
    \hat{V}_\IS(\pi; \hat{\mu}) 
    \coloneqq 
    \frac{1}{m}\sum_{i=1}^{m}w_{i}r^{i},
    \label{eq:VIS}
\end{equation}
with
\begin{equation}
    w_{i}
    \coloneqq
    \prod_{t=0}^{t_{i}}
    \frac{p_{\pi}(A_{t}=a_{t}^{i} \mid H_{t}=h_{t}^{i})}
    {\hat{p}_{\mu}(A_{t}=a_{t}^{i} \mid H_{t}=h_{t}^{i})}.
    \label{eq:wr_redef}
\end{equation}
Sufficient conditions for the estimator $\hat{V}_\IS(\pi;\hat{\mu})$ to be an unbiased estimator of $V(\pi)$ include (sequential) \emph{ignorability} and \emph{overlap}~\citep{rosenbaum1983central,robins1986new}. In our setting, ignorability may be defined as for all $t$, the conditional distribution of $R$ given $A_t$ and $H_{t}$ is the same under $\pi$ and $\mu$, i.e., $\forall t : p_{\pi}(R \mid H_{t}, A_{t}) = p_{\mu}(R \mid H_{t}, A_{t})$. Overlap is satisfied for a pair $(h, a)$ if it being observable under $\pi$ implies that it is observable under $\mu$, $p_{\pi}(A_t=a \mid H_t=h) > 0 \Rightarrow p_{\mu}(A_t=a \mid H_t=h) > 0$. We say that overlap is partially violated if this condition is violated for some pairs of histories and actions.

Even when ignorability and overlap are satisfied, if $\mu$ and $\pi$ differ significantly, the estimator $\hat{V}_\IS(\pi;\hat{\mu})$ suffers from high variance. The weighted importance sampling (WIS) estimator~\citep{rubinstein2016simulation}, 
$\hat{V}_\WIS(\pi; \hat{\mu}) \coloneqq \frac{1}{\sum_{i=1}^{m}w_{i}}\sum_{i=1}^{m}w_{i}r^{i}$,
introduces bias, but often has less variance. Under the Markov assumption, i.e., that context (or ``state'') transitions, actions and rewards depend only on the most recent context-action pair, the history $H_{t}$ in \eqref{eq:wr_redef} can be replaced by $X_{t}$. We leave out the subscript $t$ where clear.

% ----------------------------------------------------------
% -- TRUSTING IS -------------------------------------------
% ----------------------------------------------------------

\subsection{Can We Trust an IS Estimate?}
\label{sec:is_problems}

A fundamental challenge with off-policy evaluation is that no ground truth value, or even samples of it, is available. What is  worse, the assumption of ignorability cannot be verified statistically~\citep{rosenbaum2010design} and the extent of overlap is unknown if $\mu$ is unknown. As a result, assessing the quality of an estimate $\hat{V}_\IS$ inherently relies on domain expertise. 

By examining importance weights $\{w_i\}_{i=1}^m$ and estimated propensities $\hat{p}_\mu(A_t \mid H_t)$, analysts can spot outliers with extremely large weights, and compute the effective sample size (ESS)~\citep[Chapter 9]{gottesman2018evaluating,mcbook}. These practices give a per-sample and an average view of potential issues with variance and the potential for removing samples with excessive weights~\citep{crump2009dealing,sturmer2010treatment}. However, several questions remain regarding what replacing $\mu$ with $\pi$ would imply in practice:
\begin{itemize}
    \item \textbf{Where do $\pi$ and $\mu$ differ?} How can we \textit{describe} the inputs for which the most probable actions under $\pi$ differ from those under $\mu$?
    \item \textbf{If $\hat{V}(\pi) > \hat{V}(\mu)$, what gives $\pi$ the edge?} In \textit{which} situations does acting according to $\pi$ result in higher rewards than acting according to $\mu$?
\end{itemize}

Inspecting weights and propensities in aggregate or on a per-sample basis is insufficient to answer these questions as they concern \emph{patterns} in policy decisions, weights and rewards. For example, extreme weights and small ESS may indicate lack of overlap between $\pi$ and $\mu$ but do not explain the cause of the problem. Next, we show that a case-based model of the behavior policy $\mu$ can help identify said patterns by inducing a soft clustering over the space of histories.

% ----------------------------------------------------------
% -- OPE WITH PROTOTYPES -----------------------------------
% ----------------------------------------------------------

\section{Off-Policy Evaluation with Prototypes}
\label{sec:is_proto}

We propose performing off-policy evaluation using \emph{prototype learning}~\citep{li2018deep,ming2019prototypes}. The idea is to express the behavior policy $p_{\mu}(A \mid H)$ by comparing the history $H$ to a relatively small set of prototype histories from the training data, see Figure \ref{fig:model}. In a clinical setting, such a policy may correspond to physicians choosing treatment for a new patient based on their prior experience in treating similar patients. For a domain expert, trained in interpreting such cases, a prototype-based estimate is transparent as long as the number of prototypes is small enough. %By examining how importance weights, policy overlap and value estimates vary for different prototypes, we obtain answers to the questions raised above.
By examining how policy overlap and value estimates vary with prototypes, we can answer the questions raised above.

\begin{figure}[t!]
    \centering
    \includegraphics[width=0.8\linewidth]{figures/viz3.pdf}
    \caption{A schematic drawing of the prototype setup using a medical example. Each subsequence $h_{t}^{i}$ of the patient histories in the training data have a representation in the learned latent space $\mathcal{Z}$. A few subsequences are selected as prototypes---samples that are representative of the history-action space. In this example, there are three prototypes which are treated with different drug doses. Note that a patient can belong to different prototype clusters during the course of medication, as indicated with the arrows pointing out from the column vector. The action propensity $p_{\mu}(a \mid h_{t})$ of a test sample $h_{t}$ is computed by weighting the similarity between $h_{t}$ and each prototype.}
    \label{fig:model}
\end{figure}

% ----------------------------------------------------------
% -- PROTOTYPE LEARNING ------------------------------------
% ----------------------------------------------------------

\subsection{Modeling Behavior with Prototypes}
\label{sec:protomodel}

Let $\tH = [\th^{1}, \ldots, \th^{n}]^{\intercal}$ be a list of $n$ prototype histories.\footnote{From now, we refer to these as ``prototypes''.} Each prototype is a \emph{subsequence} of an observed history, $\th^j = h_t^i$ for $h^i \in \cD$ and $t \leq t_i$. We allow the prototypes to be subsequences of full-length histories since OPE requires evaluating the behavior policy at each time step. The behavior policy $p_\mu(A_{t} \mid H_{t}=h_{t})$ is approximated based on the similarity between an observation $h_t$ and the prototypes in a learned representation. The prototypes $\tH$ are themselves selected by the learning algorithm.

To learn $\tH$, we follow \citet{li2018deep, ming2019prototypes} by first learning a set of latent prototypes as free parameters $\tZ = \left[\tz_{1}, \ldots, \tz_{n}\right]^{\intercal}$ in an encoding space $\cZ$. Given an encoder $e : \cH \rightarrow \cZ$, for an arbitrary history $h_{t}$, let
\begin{equation*}
    S(\tZ, e(h_{t})) = [s(\tz_{1}, e(h_{t})), \ldots, s(\tz_{n}, e(h_{t}))]^{\intercal}
\end{equation*}
be the \emph{similarity vector} for the encoding of $h_{t}$ comparing $e(h_{t})$ to $\tZ$ using a fixed function $s : \cZ \times \cZ \rightarrow \bbR_{+}$. We use an RBF-kernel with unit bandwidth ($\gamma=1)$,
\begin{equation}
    s(\tz, e(h)) \coloneqq \exp(-\|\tz-e(h)\|_2^{2}/\gamma^{2}),
    \label{eq:rbf_sim}
\end{equation}
which takes values between $0$ (no similarity) and $1$ (full similarity). With $B \in \bbR^{k \times n}$, we estimate the behavior policy $\mu$ through logistic regression in the space induced by $S$, 
\begin{equation}
    \hp_\mu(A_{t} \mid H_{t}=h_t) = f_{\sigma}(BS(\tZ, e(h_t)) + c),
    \label{eq:softmax}
\end{equation}
where $f_{\sigma}$ denotes the softmax function over rows and $c \in \mathbb{R}^k$ is a bias term. Column $j$ of $B$ represents the coefficients determining the action probabilities associated with $\th^j$. If the coefficient $B_{ij}$ is positive, higher similarity between $h_{t}$ and $\th^j$ makes action $i$ more probable for $h_{t}$; a negative coefficient makes action $i$ less probable. 

The model parameters $\Theta = (e, B, c, \tH)$, comprising the parameters of the encoder $e$, coefficients $B$, $c$ and the set of prototypes $\tH$, are all unknown and must be learned from data. As encoder, we use either feedforward or recurrent neural networks. Following \citet{ming2019prototypes}, we learn $\Theta$ by minimizing the regularized negative log-likelihood (NLL)
\begin{equation}
    J(\Theta) = \mathrm{NLL}(\cD; \Theta) + \lambda_{d}R_{d}(\Theta) + \lambda_{c}R_{c}(\Theta) + \lambda_{e}R_{e}(\Theta)
    \label{eq:objective}
\end{equation}
using stochastic gradient descent. The regularization terms $R_{d}(\Theta)$, $R_{c}(\Theta)$ and $R_{e}(\Theta)$ encourage \emph{diversity}, \emph{clustering} and \emph{evidence}, respectively, and are defined in Appendix~\appReg. 

To make sure that prototypes represent real cases, i.e., to select $\tH$, latent prototypes are projected onto encodings of training samples at regular intervals between descent steps,
\begin{equation}
    \th^j \leftarrow \argmax_{h^i_t \in \overline{\cD}} \, s(\tz_{j}, e(h^i_t))
    \;\;\; \mbox{ and } \;\;\; 
    \tz_{j} \leftarrow e(\th^j),
    \label{eq:projection}
\end{equation}
with $\overline{\cD}$ the set of all subsequences of trajectories in $\cD$.

\paragraph{Is there a good prototype model?}{Modeling the behavior policy using prototypes places additional restrictions on the functional form of estimates. It is natural to ask: Assuming that adjusting for the history $H_t$ is sufficient for unbiased policy evaluation, do there exist prototype histories $\tilde{H}$, an encoding $e$ and a similarity function $s$ such that evaluation using the prototype model is exact or accurate? In Section~\ref{sec:experiments}, we study this question empirically. Additionally, in Appendix~\appIgnorability, we show constructively that there are indeed problems for which a prototype model exists that \emph{exactly} describes the behavior policy $\mu$.}

% ----------------------------------------------------------
% -- PROTOTYPE PREDICTIONS ---------------------------------
% ----------------------------------------------------------

\subsection{Predicting with Prototypes}
\label{sec:prediction}

When computing the estimated behavior policy \eqref{eq:softmax} for a history $h$, the similarity vector $S(e(\tH), e(h))$ determines how similar each of the $n$ prototypes are to $h$. The number $n$ is a hyperparameter. The more prototypes are used, the greater the flexibility of the model, but a large $n$ may result in $S$ consisting of multiple elements close to $1$, making predictions difficult to interpret. For example, if $s(e(\th^j), e(h))\approx 1$ for more than 10 prototypes $j$, it may be difficult to reason about the policy decision after all. 

To address this, we use only a limited number of  $q\leq n$ prototypes---so-called \emph{prediction prototypes}---when making predictions with the \textit{trained} model. Let $s_{q}(h)$ be the similarity between $e(h)$ and its $q$th most similar latent prototype. For $j=1, \ldots, n$, we truncate the similarity vector according to
\begin{equation*}
    s(\tz_{j}, e(h)) \leftarrow 
	\left\{ 
	\begin{array}{ll}
    	s(\tz_{j}, e(h)) &\text{if }\; s(\tz_{j}, e(h)) \geq s_{q}(h), \\
        0 &\text{otherwise}.
    \end{array}
    \right.
    \label{eq:predprotos}
\end{equation*}
As an example, with $q=2$ the (sorted) similarity vector in Figure \ref{fig:model} would become $[0.9, 0.6, 0]^{\intercal}$.

We perform the truncation step independently for all contexts $h$. In Section \ref{sec:analysis}, we study the resulting trade-off between transparency (small $q$ and $n$) and bias as we vary the number of (prediction) prototypes, ($q$) $n$. Note that we optimize the regularization parameters $\lambda_{d}$, $\lambda_{c}$ and $\lambda_{e}$ with respect to the choice of $q$.

% ----------------------------------------------------------
% -- PROTOTYPE VALUES --------------------------------------
% ----------------------------------------------------------

\subsection{Using Prototypes for Evaluation}
\label{sec:protovalues}

Prototypes induce a soft clustering of the space of histories. Each prototype represents a group of similar histories which can be associated with a certain distribution over actions. In Figure \ref{fig:model}, we see for example that the ``green prototype''---representing the patients in the ``green cluster''---is given a higher dose of the drug than the other prototypes. Given characteristics of the ``green prototype'', a domain expert should be able to explain why it receives this type of treatment. While it is possible to use other methods to cluster the space of histories, prototypes have the advantage of being based in cases and trained to describe groups of subjects who are treated differently under the behavior policy. We see in Section~\ref{sec:analysis} that this is beneficial also for accuracy.

When modeling the behavior policy $\mu$ using prototypes, we can utilize the induced clustering structure to answer the questions raised in Section \ref{sec:is_problems}. First of all, the prototypes $\th^{j}$ and their action probabilities
\begin{equation}
    \hp^{j}_{\mu}(a) = \hp_\mu(A=a \mid H=\th^j)
    \label{eq:prototype_policy}
\end{equation}
give an overview of the estimated behavior policy. By comparing the action probabilities $\hp^{j}_{\mu}(a)$ with the corresponding action probabilities under $\pi$, $p_{\pi}^{j}(a) = p_\pi(A=a \mid H=\th^j)$, we can explain input regions for which $\pi$ and $\mu$ differ in their suggested actions. Domain experts can use this overview to assess how well the data supports evaluation of the target policy. For example, if $\pi$, for a certain prototype, suggests actions that are extremely rare under $\mu$, there may not be enough data on these decisions to accurately estimate $V(\pi)$.

It is good practice to compare $\hat{V}(\pi)$ with $\hat{V}(\mu)$, i.e., the mean reward in data. If $\hat{V}(\pi)$ is different from $\hat{V}(\mu)$, we would like to know for which inputs $\pi$ gain or lose performance in relation to $\mu$. The prototypes allow us to divide the estimated values into prototype-based contributions and answer this question. To make the implicit clustering explicit, we define $J_t$ to be a random variable with values in $\{1, \ldots, n\}$, representing an assignment of a history $H_t$ to prototype $j$ at time $t$. We let the probability of being assigned to prototype $j$ be proportional to the similarity $s$,
\begin{equation}
    p(J_t=j \mid H_t=h_{t}) = \frac{s(\tilde{z}_j, e(h_{t}))}{\sum_{k=1}^n s(\tilde{z}_k, e(h_{t}))}.
    \label{eq:proto_prob}
\end{equation}

Now, we define the value $V_{j,t}(\pi)$ of prototype $j$ at time $t$, obtained under a policy $\pi$, as the expected future reward under $\pi$ given the assignment $J_t = j$:
\begin{equation}
    V_{j,t}(\pi) \coloneqq \E_{\pi} [R_T \mid J_t = j ].
    \label{eq:prototype_value}
\end{equation}
With $p(J_t=j)$ the marginal probability of being assigned to prototype $j$ at time $t$, by the law of total expectation, $V(\pi) = \sum_{j=1}^n V_{j,t}(\pi) p(J_t=j)$ for any $t$. Each term
\begin{equation}
    V_{j,t}(\pi) p(J_t=j)
    \label{eq:contribution}
\end{equation}
in the sum represents the contribution to the overall value $V(\pi)$ from histories which are similar to prototype $j$ at time $t$, effectively stratifying the value by types of situations. Note that we can compute $V_{j,t}(\mu)$ in a similar way to compare the estimated values of $\pi$ and $\mu$ from a prototype perspective.

We may express $V_{j,t}(\pi)$ as a weighted expectation under the behavior policy $\mu$, with importance weights $W$,
\begin{equation*}
    V_{j,t}(\pi) \coloneqq \E_{\mu} \bigg[\frac{p(J_t=j \mid H_t=h_{t})}{p_\pi(J_t=j)}  W R_T  \bigg],
\end{equation*}
where $p_\pi(J_t)$ is found by importance-weighted marginalization over $H_t$ (see derivation in Appendix~\appProtValue). We use this strategy to estimate $V_{j,t}(\pi)$ from finite samples.

% ----------------------------------------------------------
% -- EXPERIMENTS -------------------------------------------
% ----------------------------------------------------------

\section{Experiments}
\label{sec:experiments}

In Section \ref{sec:sepsisexp}, we illustrate our method by examining an example of sepsis management. Using patient data from the MIMIC-III database~\citep{mimiciii}, we evaluate a replication of the so-called AI Clinician implemented by \cite{komorowski2018artificial}, see below for details. In Section \ref{sec:analysis}, we inspect the prototype model in more detail. We study the trade-off between transparency and bias and compare the model to several baseline estimators. By utilizing a sepsis simulator, we also investigate the bias induced by prototypes as a function of the sequence length.  

\paragraph{AI Clinician.}{The AI Clinician is a clinical decision support model for sepsis management~\citep{komorowski2018artificial}. The model is learned from data of sepsis patients extracted from the MIMIC-III database~\citep{mimiciii}. The patient data (e.g., demographics, vital signs and laboratory values) are coded as multidimensional time series with a discrete time step of 4 hours. There are two treatment variables: the total volume of intravenous (IV) fluids $(f)$ and maximum dose of vasopressors $(v)$ administered over each 4-hour period. In short, the AI Clinician is built by clustering the data into 750 states, discretizing the combinations of treatment doses into 25 possible actions $(f, v) \in \{0, 1, 2, 3, 4\}^2$, and solving the corresponding Markov decision process using value iteration. The final reward $r^{i}$ is $+100$ if the patient survived and $-100$ if the patient died. This process is repeated 500 times, each time with a new train-test split, and the model with the highest WIS value estimate on the test set is taken as the target policy, $\pi_{\text{AIC}}$. We use the data split associated with $\pi_{\text{AIC}}$ in our experiments.}

\paragraph{Experimental setup.}{We consider two types of encoders for the prototype framework: a feedforward neural network (FNN) and a recurrent neural network (RNN). Both encoders have two layers of size 64 with ReLu and tanh, respectively, as activation function. We name these models ProNet and ProSeNet, respectively. We compare the prototype framework to several baseline models: a logistic regression classifier (LR), a random forest classifier (RF), a vanilla FNN, a vanilla RNN, and a model based on post-hoc clustering of RNN-encoded histories. The neural network baselines have the same structure as the corresponding prototype encoder. In the main sepsis experiment, we train all neural networks over 400 epochs, using a batch size of 64 for RNN and ProSeNet, and 1024 for FNN and ProNet. We use the Adam algorithm for optimization with learning rate 0.001, weight decay 0.001 and otherwise default parameters. Furthermore, we use the NLL loss when training the vanilla neural networks. All models are calibrated using sigmoid calibration on a held-out validation set (\SI{25}{\percent} of the training data). Further details, including hyperparameter selection, are provided in Appendix \appExpDetails.\footnote{The code is available at \url{https://github.com/Healthy-AI/case_based_ope}.}}

% ----------------------------------------------------------
% -- POLICY EVALUATION -------------------------------------
% ----------------------------------------------------------

\subsection{Demonstrating the Framework}
\label{sec:sepsisexp}

To demonstrate the benefit of learning prototypes in OPE, we estimate the behavior policy $\mu$, i.e., the policy followed by clinicians in the MIMIC-III data, using a prototype model with $n=10$ prototypes, $q=2$ prediction prototypes and an RNN encoder (i.e., a ProSeNet model). As an overview of the relationship between prototypes, a PCA plot of encoded training data is shown in Figure \ref{fig:sepsis:pca_action}. The latent prototypes are numbered 1--10 and the colors indicate action chosen by $\mu$ in the data. Note that the figure is intended to help the reader to orient him/herself in this section; we do not rely on this projection in itself.

\begin{figure}[t!]
    \centering
    \includegraphics[width=0.9\linewidth]{figures/prosenet_10_2_5_pca.pdf}
    \caption{A PCA plot of encoded training data, colored  w.r.t. the action (dose of IV fluids and vasopressors) taken by the physicians. The prototypes are numbered 1--10. Note that prototype learning affects the structure of the encoding space. Post-hoc clustering of the latent space of a model trained without prototypes gives a substantially worse approximation of the behavior policy, see Table~\ref{tab:sepsis:performance}.}
    \label{fig:sepsis:pca_action}
\end{figure}

We can interpret the prototypes by visualizing trajectories of the corresponding patients. In Figure \ref{fig:sepsis:features_actions}, we take a closer look at prototypes 5, 7 and 8, which represent each of the major clusters in Figure \ref{fig:sepsis:pca_action}. By plotting three key features---heart rate (HR), mean blood pressure (BP) and SOFA score---and the treatment variables against time, we immediately get a sense of which type of patients the prototypes represent. For example, the patient corresponding to prototype 5 has high heart rate, low blood pressure and high SOFA score---signs of severe sepsis---and receives an aggressive treatment.\footnote{SOFA is an abbreviation for sequential organ failure assessment.} The prototype 7 patient, who has lower heart rate, higher blood pressure and lower SOFA score, receives low doses of IV fluids and vasopressors.

\begin{figure}[t!]
    \centering
    \includegraphics[width=0.75\columnwidth]{figures/prosenet_10_2_5_features_actions_578.pdf}
    \caption{Vital signs and SOFA score plotted against time for three different prototype patients (upper three panels). The dashed black lines show the data average of each feature and the shaded areas mark $\pm 3$ standard deviations. The lower two panels show the actions taken by physicians. The time index of each prototype subsequence is marked with filled marker; for example, prototype 5 is the subsequence ending at time step 4 of the corresponding patient history.}
    \label{fig:sepsis:features_actions}
\end{figure}

We understand that trajectories that are most similar to prototype 7 in the latent space belong to patients who are currently relatively healthy, which they are more likely to be at an early stage of the course. Interestingly, we observe that all encoded histories until time $t=0$ are most similar to either prototype 7 or prototype 9. Is it therefore relevant to ask the question: If we were to follow the AI Clinician instead of the behavior policy, how would the treatment strategy change in an initial stage? That is, how do $\mu$ and $\pi_{\text{AIC}}$ differ for prototypes 7 and 9?

We can answer this question by comparing the distributions of actions taken  under $\mu$ and $\pi_{\text{AIC}}$ for these prototypes, see Figure \ref{fig:sepsis:probas}. For prototype 7, we see that the most likely treatment under both $\mu$ and $\pi_{\text{AIC}}$ is to not give any IV fluids or vasopressors. However, under $\pi_{\text{AIC}}$, there is also a relatively high probability of increasing the dose of IV fluids---a rare action under $\mu$. The differences are even greater for prototype 9, where $\pi_{\text{AIC}}$ has a nonzero probability of giving an aggressive treatment with combinations of IV fluids and vasopressors. Under $\mu$, these treatments have almost zero probability. A domain expert can reason about the validity of $\pi_{\text{AIC}}$: given characteristics of the patient corresponding to prototype 9, would it be medically sound to treat this type of patient as suggested by $\pi_{\text{AIC}}$ in Figure \ref{fig:sepsis:probas}?

\begin{figure}[t!]
    \centering
    \includegraphics[width=0.9\columnwidth]{figures/probas_79.pdf}
    \caption{Action distribution under $\mu$ and $\pi_{\text{AIC}}$ for prototype 7 and 9. For $\mu$ the probabilities are computed according to \eqref{eq:prototype_policy}. Since $\pi_{\text{AIC}}$ is deterministic, we normalize the distribution of actions suggested by $\pi_{\text{AIC}}$ for input histories that are most similar to respective prototype in the latent space.}
    \label{fig:sepsis:probas}
\end{figure}

From an OPE perspective, the initial differences between $\mu$ and $\pi_{\text{AIC}}$ make it difficult to accurately estimate $V(\pi_{\text{AIC}})$. A known problem with importance sampling is that the variance of the weights $w_{i}$ can grow exponentially with the sequence length~\citep{liu2018breaking}. Here, the assumption of overlap is potentially partially violated already at the first time step, and regardless of model of the behavior policy, we observe high variance in the IS weights (ranging from $\ll{1}$ to the order of $10^{3}$) and an extremely small effective sample size ($<10$). To reduce variance, we instead evaluate a different target policy where we follow $\pi_{\text{AIC}}$ in the first time step and then follow $\mu$ until the end of the sequence. For comparison, we do the same with a zero-drug policy $\pi_{0}$ which suggests leaving patients untreated.

\begin{figure}[t!]
    \centering
    \includegraphics[width=0.85\linewidth]{figures/policy_values.pdf}
    \caption{Bootstrap estimates of the value of the target policy of following the AI Clinician and the zero-drug policy, respectively, for one time step and then following the behavior policy. The estimated value of the behavior policy $\mu$, $\hat{V}(\mu)$, is included as a reference.}
    \label{fig:sepsis:wis}
\end{figure}

Figure \ref{fig:sepsis:wis} shows 100 bootstrap estimates of the policy values using five different estimators of $\mu$: the baselines LR, RF, FNN and RNN, and our prototype model. For all models except the RNN baseline and the prototype model, we make the Markov assumption and model $p(A \mid H)$ using only the last context-action pair of the history. We note that the results are consistent across estimators. In comparison with the estimated value of $\mu$, $\hat{V}(\mu)$, the results indicate that it could be beneficial to avoid giving drugs to patients at the initial time step, while it seems less favorable to follow $\pi_{\text{AIC}}$.

\begin{figure}[t!]
    \centering
    \includegraphics[width=0.9\columnwidth]{figures/VtpJt_0.pdf}
    \caption{Bootstrap estimates of $V_{j,0}(\pi)p(J_0=j)$ for prototype 1, 7 and 9. As described in Section \ref{sec:protovalues}, $V_{j,t}(\pi)p(J_t=j)$ is the the value of prototype $j$ at time $t$ multiplied with the marginal probability of being assigned to prototype $j$ at time $t$.}
    \label{fig:sepsis:VtpJt}
\end{figure}

The prototypes allow us to break down the result and answer the question: If $\hat{V}(\pi) \neq \hat{V}(\mu)$, where does $\pi$ gain or lose performance? In Figure \ref{fig:sepsis:VtpJt}, we show 100 bootstrap estimates of the contribution to the overall value, see \eqref{eq:contribution}, for prototypes 1, 7 and 9 at time $t=0$. These prototypes define the bottom right cluster of Figure \ref{fig:sepsis:pca_action} where all encoded histories until time $t=0$ belong. As expected, prototype 7 and 9 contribute the most to the overall value. Note that the difference in variance between these prototypes is explained by Figure \ref{fig:sepsis:probas} where $\pi_{\text{AIC}}$ and $\mu$ differ more for prototype 9 than for prototype 7. Also note that the trend in Figure \ref{fig:sepsis:wis} is repeated here: following the zero-drug policy at the initial time step is generally better than following the AI Clinician.

% ----------------------------------------------------------
% -- MODEL PERFORMANCE -------------------------------------
% ----------------------------------------------------------

\subsection{Performance of the Prototype Model} 
\label{sec:analysis}

While increasing transparency, the use of prototypes imposes restrictions on the model, possibly increasing the approximation error. In Figure \ref{fig:sepsis:pronet_prosenet}, we show the accuracy of two different prototype models---ProNet and ProSeNet---in approximating $p_\mu(A\mid H_t)$ on the sepsis test data for a varying number of prototypes $n$ and prediction prototypes $q$ (see Section \ref{sec:prediction}). As encoder, the models use an FNN (ProNet) and an RNN (ProSeNet), respectively. The sequential model, making use of the entire history $H_t$, performs the best, especially for $q=1$ and $q \geq 4$. Interestingly, the effect of increasing the number of prototypes from 10 to 50 or even 100 is small. Using only two prediction prototypes works well for this dataset.

\begin{figure}[t!]
    \centering
    \includegraphics[width=\linewidth]{figures/pronet_prosenet_acc.pdf}
    \caption{Accuracy on the sepsis test data for ProNet and ProSeNet using a varying number of (prediction) prototypes $(q)$ $n$. The setting with $n=10$ and $q=2$ works well here.}
    \label{fig:sepsis:pronet_prosenet}
\end{figure}

\begin{figure}[t!]
    \centering
    \includegraphics[width=\linewidth]{figures/length_bias.pdf}
    \caption{The relative error of estimated importance weights for increasing sequence lengths in the sepsis simulator. Ideally, if $\hat{\mu}$ is a good estimator of $\mu$, the ratio $w_{\hat{\mu}}/w_{\mu}$ should be equal to 1. We see that approximating $\mu$ with prototype models gives rise to a bias in relation to modeling $\mu$ with a plain FNN.}
    \label{fig:sepsis:bias}
\end{figure}

In Table \ref{tab:sepsis:performance}, we compare the prototype models with $n=10$ and $q=2$ to the baseline estimators in approximating $p_\mu(A\mid H_t)$. Here, we report accuracy, the area under the ROC curve (AUC) and the static calibration error (SCE)~\citep{nixon2019measuring}, a multiclass extension of the expected calibration error. The prototype models are superior to the (regularized) LR but they perform slightly worse than the black-box models RF, FNN and RNN. However, as we see in Figure \ref{fig:sepsis:pronet_prosenet}, with increased number of prototypes, ProSeNet has the capacity to approach the performance of these models, at least in terms of accuracy. Finally, we note that the prototype models are superior to a model where post-hoc clustering of the RNN encodings are used to identify ``prototypes'', showing the power of learning prototypes in a supervised manner.

In practice, the trade-off between transparency and bias is likely less of a problem. In the process of diagnosing policy value estimates, we may sacrifice some accuracy in favor of interpretability. That is, we can learn a model with few prototypes to reason about the target and behavior policies. Then, if the initial analysis indicates that the data supports evaluation of the target policy, we can learn a more complex model for the actual policy evaluation. 

\begin{table*}[t!]
    \caption{A summary of performance on the sepsis test data for different estimators of the behavior policy $p_\mu(A\mid H_t)$. For ProNet and ProSeNet, $n=10$ and $q=2$. The 95 percent confidence intervals are calculated from 1000 bootstraps.}
    \centering
    \begin{tabular}{@{}llll@{}} \toprule
    Model                       & Accuracy ($\uparrow$)     & SCE ($\downarrow$)        & AUC ($\uparrow$)  \\ \midrule
    LR                          & 0.38 (0.38, 0.39)         & 0.0112 (0.0110, 0.0115)   & 0.88 (0.88, 0.88) \\
    RF                          & 0.62 (0.61, 0.62)         & 0.0037 (0.0034, 0.0039)   & 0.93 (0.93, 0.93) \\
    Post-hoc clustering         & 0.44 (0.44, 0.45)         & 0.0097 (0.0096, 0.0101)   & 0.86 (0.85, 0.86) \\
    FNN                         & 0.61 (0.61, 0.61)         & 0.0041 (0.0039, 0.0044)   & 0.93 (0.92, 0.93) \\
    ProNet ($n=10$, $q=2$)      & 0.56 (0.55, 0.56)         & 0.0069 (0.0067, 0.0072)   & 0.90 (0.90, 0.90) \\
    RNN                         & 0.62 (0.62, 0.63)         & 0.0056 (0.0053, 0.0058)   & 0.94 (0.94, 0.94) \\
    ProSeNet ($n=10$, $q=2$)    & 0.57 (0.57, 0.58)         & 0.0057 (0.0054, 0.0059)   & 0.91 (0.91, 0.91) \\
    \bottomrule
    \end{tabular}
    \label{tab:sepsis:performance}
\end{table*}

% ----------------------------------------------------------
% -- MODEL BIAS --------------------------------------------
% ----------------------------------------------------------

\subsubsection{Bias Due to Increased Sequence Length}
\label{sec:simulator}

If the use of prototypes introduces a bias in the estimated propensity, it is natural to ask what it means for the sequential setting, where multiple propensities are multiplied together to form the importance weights. To quantity this effect, we consider the synthetic environment of sepsis management provided by~\cite{oberst2019counterfactual}. By sampling a large amount of data from the environment, we estimate the true parameters of the underlying Markov decision process. We then learn an optimal behavior policy using policy iteration. We refer to Appendix~\appSepsisSim~for details.

We collect trajectories of the behavior policy of various lengths, from 5 to 30 time steps. For each trajectory length, we use the data to (a) estimate the behavior policy $\mu$ using a vanilla FNN and FNN-based prototype models with varying number of (prediction) prototypes, respectively; (b) learn a target policy $\pi$ using policy iteration; and (c) estimate the value of $\pi$ using both the true behavior policy $\mu$ and its estimators $\hat{\mu}$. Note that any difference in the estimated values stems from a difference in the importance weights $w$. In Figure \ref{fig:sepsis:bias}, we plot the relative error of estimated weights against the trajectory length for four estimators of $\mu$.\footnote{Note that the probabilities under $\pi$ cancel when considering the ratio of the weights.} Averaging over 100 iterations of the sampling and learning process, we see that the ratio $w_{\hat{\mu}}/w_{\mu}$ generally differs from 1 for all estimators and that modeling with prototypes induces larger bias than using an FNN. For longer sequences, the number of prediction prototypes $q$ becomes critical.

Finally, we quantify the absolute effect prototypes have on the value estimate for sequences of length 15, which is close to the average sequence length in the MIMIC-III data. We compute the true value of $\pi$ by running it in the simulator and compare this value to weighted IS estimates using the estimators in Figure \ref{fig:sepsis:bias}. We observe final rewards $r^{i}=\pm 1$ if a simulated patient is discharged or dies; otherwise $r^{i}=0$. Averaging over 100 iterations, the estimated value has an absolute difference from the true value that amounts to 0.39 for the FNN (standard deviation 0.27), 0.46 (0.33) for ProNet with $n=10$ and $q=2$, 0.50 (0.34) for ProNet with $n=10$ and $q=5$ and 0.41 (0.31) for ProNet with $n=100$ and $q=5$. These results should be compared to the average value of $\pi$ (0.06 (0.10)) and the WIS estimate using the true behavior policy (0.38 (0.27)).

% ----------------------------------------------------------
% -- RELATED WORK ------------------------------------------
% ----------------------------------------------------------

\section{Related work}

% High variance of IS and related issues
Issues with importance sampling methods for OPE are well known. Several works aim at describing issues related to high variance~\citep{gottesman2019guidelines}, or mitigating them using methodological advances~\citep{precup2000eligibility,thomas2016data,jiang2016doubly,schneeweiss2009high,swaminathan2015counterfactual}. Others aim to use weights to identify a new study  population for which the policy's value can be more efficiently estimated~\citep{li2018balancing,fogarty2016discrete}. \citet{oberst2020characterization} emphasize the value of interpretability in this endeavour to communicate the generalizability of the estimate. Our method is compatible with all three approaches, allowing for transparent descriptions of variance issues, identifying new study populations and for use as plug-in estimates.

% Interpretability
Interpretability is a an important component of learning systems deployed in increasingly critical functions~\citep{rudin2019stop,lipton2018mythos}. Rule-based estimators, such as rule list~\citep{wang2015falling} and decision trees, are often favored for their short descriptions but generalize poorly to sequential inputs which are the focus of this work. \citet{gottesman2020interpretable} proposed an approach for interpretable OPE which highlights transitions in data whose removal would have a large impact on the estimate. This approach is related to ours but answers a different set of questions.  

% Matching
Evaluating policies using direct sample-to-sample comparison has a long tradition in policy evaluation through the use of matching estimators of causal effects, see e.g.,~\citep{rosenbaum1983central,rubin2006matched,kallus2020generalized}. While favored for its transparency, this approach is typically only used to compare two deterministic policies such as ``treat all'' or ``treat none''. Matching often relies either on specifying a similarity function in advance or on an estimate of the behavior policy. In high-dimensional settings, this often leads to bias or lost interpretability. Our approach aims to combine the transparency of matching estimators with the flexibility of representation learning methods.

% ----------------------------------------------------------
% -- CONCLUSION --------------------------------------------
% ----------------------------------------------------------

\section{Conclusion}

In this work, we have studied off-policy evaluation (OPE) using importance sampling (IS) in the case where the behavior policy $\mu$ is unknown and must be estimated from data. While IS is a popular OPE method, it may be difficult to assess the quality of an IS value estimate. Standard practices, such as inspecting importance weights, provide only an average or a per-sample view of potential issues. To address this issue, we proposed estimating the behavior policy for IS using prototype learning to better explain patterns in policy decisions and value estimates. We demonstrated our idea using a real-world example of sepsis management. While the use of prototypes increases the approximation error, we found that prototype models have the capacity to perform similarly to plain neural networks. 

When reflecting upon the results of the sepsis study it may seem strange that it would be advantageous to never treat patients at the onset of sepsis. Even though prototypes serve as a tool for inspecting policies and value estimates, they do not answer all questions. For example, there may be variables affecting both the treatment and the outcome that are not present in the observed data. In such a case, the ignorability assumption defined in Section \ref{sec:ope_def} no longer holds. Failure to detect such an issue is not a limitation of prototypes; ignorability cannot be verified by statistical means. 

A limitation of our analysis is that it does not separate different types of errors introduced by prototype learning. We conjecture that a model with fewer prototypes is more likely to overestimate overlap between behavior and target policies, rather than underestimate it, due to increased smoothness in the estimated behavior policy. This will likely result in less extreme importance weights and lower variance, potentially at the cost of increased bias. We hope to provide analysis which more precisely characterizes the approximation error as a function of the number of prototypes in future work.

Finally, we have used subsequences of all variables in patient histories as prototypes. This choice is aligned with the literature on sequence prototypes but is not the only option. For example, in applications of prototype learning to image classification, parts of images were used as prototypes~\citep{li2018deep}, not entire images from the training set. To simplify description further and improve interpretability in policy evaluation, we may define prototypes as sequences of only variables which are important for the behavior policy.

% ----------------------------------------------------------
% -- ACKNOWLEDGEMENTS --------------------------------------
% ----------------------------------------------------------

\begin{acknowledgements}
We would like to thank Patrick Royer for insightful discussions regarding the sepsis management policy evaluation. Furthermore, we would like to thank  Devdatt Dubhashi, Emil Carlsson, Morteza Haghir Chehreghani and Adam Breitholtz for valuable feedback on this work. 

This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. 

The computations in this work were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC) at Chalmers Centre for Computational Science and Engineering (C3SE) partially funded by the Swedish Research Council through grant agreement no. 2018-05973.
\end{acknowledgements}

% ----------------------------------------------------------
% -- BIBLIOGRAPHY ------------------------------------------
% ----------------------------------------------------------

\bibliography{matsson_289}

\end{document}
