\documentclass[accepted]{uai2022} % for initial submission
%\documentclass[accepted]{uai2021} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2021} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2021} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)
%\usepackage{times}  % DO NOT CHANGE THIS
%\usepackage{helvet} % DO NOT CHANGE THIS
%\usepackage{courier}  % DO NOT CHANGE THIS
%\usepackage[hyphens]{url}  % DO NOT CHANGE THIS
%\usepackage{graphicx} % DO NOT CHANGE THIS
%\urlstyle{rm} % DO NOT CHANGE THIS
%\def\UrlFont{\rm}  % DO NOT CHANGE THIS
%\usepackage{natbib}  % DO NOT CHANGE THIS AND DO NOT ADD ANY OPTIONS TO IT
%\usepackage{caption} % DO NOT CHANGE THIS AND DO NOT ADD ANY OPTIONS TO IT

\usepackage{microtype}
\usepackage{graphicx}
\usepackage{subfigure}


% hyperref makes hyperlinks in the resulting PDF.
% If your build breaks (sometimes temporarily if a hyperlink spans a page)
% please comment out the following usepackage line and replace
% \usepackage{icml2021} with \usepackage[nohyperref]{icml2021} above.
\usepackage{hyperref}
\usepackage{algorithmic}
\usepackage{algorithm}
% Attempt to make hyperref and algorithmic work together better:
%\newcommand{\theHalgorithm}{\arabic{algorithm}}


\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}  % use 8-bit T1 fonts
%\usepackage{hyperref}    % hyperlinks
\usepackage{url}      % simple URL typesetting

\usepackage{amsfonts}    % blackboard math symbols
\usepackage{nicefrac}    % compact symbols for 1/2, etc.

\usepackage{shortcuts}
\usepackage{collcell}
\usepackage{amsthm}
\usepackage{amsmath}
\usepackage{amssymb}
\theoremstyle{plain}
\newtheorem{definition}{Definition}
\newtheorem{proposition}{Proposition}
\newtheorem{assumption}{Assumption}
\newtheorem{remark}{Remark}
\newtheorem{lemma}{Lemma}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}{Corollary}
%\usepackage{wrapfig}
\usepackage{url}
\usepackage{color, colortbl}
\definecolor{Gray}{gray}{0.9}
%\usepackage{isomath}
%Following defines command for two vertical bars used for KL-divergence
\usepackage{mathtools}
\usepackage[T1]{fontenc}
%\usepackage{lmodern}
\usepackage{multirow}
\usepackage{mathabx}
\usepackage{dsfont}
\usepackage[
colorinlistoftodos,
%  disable,
textsize=footnotesize,
]{todonotes} 
\newcommand{\KA}[1]{{\color{blue}#1}} 
\usepackage[%
%capitalize,
sort&compress
]{cleveref}
\Crefname{assumption}{Assumption}{Assumption}

\usepackage{wrapfig}
\usepackage{xspace}
\newcommand{\MMCCP}{\mbox{MMCCP}\xspace}

\hypersetup{
    colorlinks = true,
    citecolor  = blue,
    linkcolor  = blue
}



%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Generalizing Off-Policy Learning under Sample Selection Bias}

% The standard author block has changed for UAI 2021 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
% \href{mailto:<thatt@ethz.ch>?Subject=GenPl}{
\author[1]{Tobias Hatt}
\author[1]{Daniel Tschernutter}
\author[1,2]{Stefan Feuerriegel}
%\author[3]{Further~Coauthor}
%\author[1]{Further~Coauthor}
%author[3]{Further~Coauthor}
%\author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    ETH Zurich\\
    Switzerland
}
\affil[2]{%
    LMU Munich\\
    Germany
}
  
\begin{document}
\maketitle

\begin{abstract}
	Learning personalized decision policies that generalize to the target population is of great relevance. Since training data is often not representative of the target population, standard policy learning methods may yield policies that do not generalize target population. To address this challenge, we propose a novel framework for learning policies that generalize to the target population. For this, we characterize the difference between the training data and the target population as a sample selection bias using a selection variable. Over an uncertainty set around this selection variable, we optimize the minimax value of a policy to achieve the best worst-case policy value on the target population. In order to solve the minimax problem, we derive an efficient algorithm based on a convex-concave procedure and prove convergence for parametrized spaces of policies such as logistic policies. We prove that, if the uncertainty set is well-specified, our policies generalize to the target population as they can not do worse than on the training data. Using simulated data and a clinical trial, we demonstrate that, compared to standard policy learning methods, our framework improves the generalizability of policies substantially.
\end{abstract}



\sloppy
\raggedbottom

\section{Introduction}
\label{introduction}

%General motviation of individual-level decision policies
Learning personalized policies has become integral to modern decision-making in a variety of domains such as medicine \citep{Hill2013}, and public policy \citep{kube2019allocating}. Since in these domains exploration is costly or otherwise infeasible, many methods have been proposed for off-policy learning, \ie, policy learning from existing data \citep[\eg,][]{dudik2014doubly, kallus2018balanced, athey2017efficient, tschernutter2022interpretable}.
%A prominent application of these methods is seen in personalized medicine using electronic health records \citep{kosorok2019precision}.

%Generalizability: observed data is not representative for target population
%possibly biased sample of the population of interest (henceforth, the target population) 

%A major challenge in policy learning is generalizability \citep{levine2020offline}. %This key assumption is fundamentally unverifiable \TODO{ref} and, if not fulfilled, polices learned under this assumption may not be effective when implemented in the target population \citep{zhao2019robustifying}. In order to learn policies that generalize, 
%A major challenge in off-policy learning is the generalizability of policies, which is concerned with whether a learned policy can be implemented in the target population. Standard methods for policy learning assume that the data used for training (\ie, training data) is representative of the target population \citep[\eg,][]{beygelzimer2009offset, dudik2014doubly, kallus2018balanced}. Unfortunately, training data is often not representative of the target population, bringing into question the generalizability of policies learned with standard methods \citep{willan2004regression, cole2010generalizing, buchanan2018generalizing, rothwell2005external, stuart2011use, downs1998feasibility, norris2001effectiveness}. For instance, a review of the eligibility criteria of 20 AIDS Clinical Trial Group studies found that 28\% to 68\% of women living with human immunodeficiency virus (HIV) and women at risk for HIV infection in the USA would have been excluded from these studies \citep{gandhi2005eligibility}. As a consequence, subjects in these studies are not representative of the population of HIV-positive patients in the USA (\ie, the target population) and any evidence derived from these studies may not be effective. Hence, the generalizability of policies is crucial for making decisions that are relevant in practice.



A major challenge in off-policy learning is the generalizability of policies. Generalizability is concerned with whether a policy learned on the data for training (\ie, training data) is also effective in the target population. Standard methods for policy learning yield policies that are effective on the target population, if, and only if, the training data is representative of the target population \citep[\eg,][]{beygelzimer2009offset, dudik2014doubly}. However, this may not hold true in practice \citep[\eg,][]{buchanan2018generalizing, cole2010generalizing, downs1998feasibility,flores2021assessment, norris2001effectiveness,rothwell2005external}. For instance, a review of HIV/AIDS clinical trials found that women are largely underrepresented in these trials \citep{gandhi2005eligibility, greenblatt2011priority}, so that data from these trials is not representative of the actual target population (i.e., the population of HIV-positive patients in the USA). Therefore, when data from such trials is used to derive policies, standard methods for policy learning may not generalize to the target population. As such, these policies may be ineffective or even harmful on the target population and, therefore, not relevant in practice.

%For instance, the Women’s Interagency HIV Study (WIHS) is a prospective, observational, and multi-center study, which is considered to be representative of women living with human immunodeficiency virus (HIV) and women at risk for HIV infection in the US \citep{bacon2005women}. 



%In this paper, ...
%do so by recognizing that the observed data may not representative of the target population and
In this paper, we develop a framework for learning policies from training data that generalize to the target population.\footnote{Code available at \url{github.com/tobhatt/GeneralOPL}.} For this, we characterize the difference between training data and target population as a sample selection bias using an unknown selection variable \citep[\eg,][]{cortes2008sample, manski1989anatomy}. If we had oracle access to the true selection variable, we could re-weight the data accordingly in order to obtain the value of a policy on the target population. Since, in practice, the true selection variable is unknown, the value of a policy on the target population is not identifiable from training data. Instead, we derive bounds on the odds-ratio of the selection probability, which yields an uncertainty set around the true selection probabilities. Then, our framework optimizes the minimax value of a policy to achieve the best worst-case policy value on the target population. We prove that, if the uncertainty set is well-specified, our framework yields policies that do not do worse on the target population than the worst-case policy value estimated from the training data. As such, these policies can generalize to the target population. In order to efficiently optimize the minimax value of a policy, we show that it can be written as a difference of convex functions~(DC) program. Then, by leveraging the structure of the adversarial subproblem, we develop a tailored \textbf{m}ini\textbf{m}ax \textbf{c}onvex-\textbf{c}oncave \textbf{p}rocedure~(\MMCCP). We prove that \MMCCP converges for certain parameterized spaces of policies such as logistic policies. Using synthetic data and a clinical trial, we demonstrate that standard policy learning methods generalize poorly, while our framework improves the generalizability of policies substantially. As such, our framework enables to learn reliable policies that can be implemented in the target population.
%If we were given samples of the selection variable, we could estimate individual's probability of being selected into the training data and re-weight the data accordingly in order to obtain the policy value on the target population.

%Summary of contribution
%We summarize our \textbf{contributions}\footnote{Code available at \url{github.com/anonymous/GeneralOPL} (link anonymized for peer-review; code for review in the supplements.)} as follows:
%\begin{enumerate}
%	\vspace{-0.5em}
%	\item We formulate the problem of learning policies that generalize to the target population as a sample selection bias.
%	\vspace{-.5em}
%	\item We introduce a framework for generalizable off-policy learning, which 
%	optimizes the minimax policy value over an uncertainty set around the selection probabilities.
%	\vspace{-.5em}
%	\item We develop a tailored \textbf{m}ini\textbf{m}ax \textbf{c}onvex-\textbf{c}oncave \textbf{p}rocedure~(\MMCCP) to efficiently optimize the minimax policy value and prove that \MMCCP converges for certain parametric policies. 
%\end{enumerate}

%We define generalizability as the extension of inferences
%from the trial to a target population that coincides, or is a
%subset of, the trial-eligible population (Perils and potentials of self-selected
%entry to epidemiological studies and surveys; Extending inferences from a randomized trial to a target population).
%
%We define transportability
%as the extension of inferences from the trial to
%a target population that includes individuals who are not
%part of the trial-eligible population (others (Transportability
%of trial results using inverse odds of sampling weights) have proposed
%different definitions). In this context, we collectively
%refer to generalizability and transportability as extending
%inferences from trial participants to a target population.

\section{Preliminaries}
In this section, we describe the setup, formulate the problem of generalizing policies, and discuss related work.

\subsection{Setup}
\label{problem_setup}
%We consider the random variables $(X, T, Y) \sim \mathbb{P}$, which consists of covariates $X\in \cl X \subseteq \mathbb R^d$, the treatment assignment $T\in\{0, 1\}$, and the outcome $Y\in\Rl$. 
We consider a binary treatment $T\in\{0, 1\}$, covariates $X\in \cl X \subseteq \mathbb R^d$, and the outcome $Y\in\Rl$. We use the convention that lower outcomes are preferred. Using the Neyman-Rubin potential outcomes framework \citep{Rubin2005}, let $Y(0), Y(1)$ be the potential outcomes for each of the treatments. Further, let a \emph{policy} $\pi$ be a map from the covariates to the probability of treatment assignment, \ie, $\pi: \cl X \rightarrow [0,1]$. Then, the \emph{policy value} of $\pi$ is given by 
\begin{align}\label{eq:policy_value}
	\begin{split}
		V(\pi) &= \mathbb{E}[Y^\pi] = \mathbb{E}[\pi(X)Y(1) + (1-\pi(X))Y(0)].
	\end{split}
\end{align}%under the distribution $\mathbb{P}$ 
The objective of \emph{policy learning} is to find a policy $\overline{\pi}$ in a policy class $\Pi$ that minimizes the policy value, \ie, \mbox{$\overline{\pi} \in \argmin_{\pi\in\Pi}{V(\pi)}$}.

%Standard assumptions for causality
We make the following three standard assumptions \citep{RubinD.B1974}: (i)~consistency (\ie, \mbox{$Y=Y(T)$}); (ii)~positivity (\ie, \mbox{$0<p(T=1 \mid X=x)<1$} for all $x$); and (iii)~strong ignorability (\ie, \mbox{$Y(0), Y(1) \ci T \mid X$}). Then we can identify the policy value in \labelcref{eq:policy_value} in terms of the observed data $(X, T, Y)$.%\footnote{Note that \labelcref{eq:policy_value} abuses notation slightly, since $Y(1), Y(0)$ are never observed and, therefore, not included in the observed data $(X, T, Y) \sim \mathbb{P}$. However, due to strong ignorability, \labelcref{eq:policy_value} can be written in terms of the observed data.}

\subsection{Problem Formulation}\label{sec::problem_formulation}
Suppose we are interested in learning a policy that minimizes the policy value under the target distribution $(X, T, Y) \sim \mathbb{P}$, \ie, $V_{\text{Target}}(\pi)$. However, we are \emph{not }given data from the target distribution, but only data from a (potentially different) training distribution $(X, T, Y) \sim\mathbb{P}_{\text{Train}}$.

Standard policy learning methods assume that the training and target distributions are identical. However, even in carefully designed clinical trials, the subjects in the trial are often not representative of the target population, \ie, $\mathbb{P}_{\text{Train}} \neq \mathbb{P}$ \citep[\eg,][]{buchanan2018generalizing, cortes2008sample, downs1998feasibility,flores2021assessment, gandhi2005eligibility, greenblatt2011priority, rothwell2005external}. Hence, standard methods for policy learning yield policies that minimize the policy value on the training data, \ie, $V_{\text{Train}}(\pi)= \mathbb{E}_{\text{Train}}[Y^\pi]$. However, since the policy value depends on the underlying data distribution, these policies may \emph{not} minimize the policy value on the target population, \ie, $V_{\text{Target}}(\pi) = \mathbb{E}[Y^\pi]$. This can be seen when writing $\mathbb{E}[Y^\pi]$ in terms of the distribution $\mathbb{P}_{\text{Train}}$ using a change of probability measure:
\begin{equation}\label{eq:change_measure}
	\mathbb{E}[Y^\pi] = \mathbb{E}_{\text{Train}}[R\,Y^\pi],
\end{equation}
where the random variable $R = \text{d}\mathbb{P}/\text{d}\mathbb{P}_{\text{Train}}$ is the \emph{Radon-Nikod\'{y}m derivative},\footnote{The standard assumption that $\mathbb{P}$ is absolute continuous with respect to $\mathbb{P}_{\text{Train}}$, \ie, $\mathbb{P} \ll \mathbb{P}_{\text{Train}}$, is made in order to ensure that the Radon-Nikod\'{y}m derivative is well-defined.} also know as \emph{density ratio}. As a direct implication, if $\mathbb{P}_{\text{Train}} \neq \mathbb{P}$ and, thus, $R\neq 1$,
%and therefore $p(X, Y, T)/q(X, Y, T) \neq 1$
it follows that
\begin{equation}
	\mathbb{E}[Y^\pi] \neq	\mathbb{E}_{\text{Train}}[Y^\pi].
\end{equation}
In other words, a policy learned from training data using standard methods may not generalize to the target population, and, as such, may be of little help in practice. 

%In this paper, ...
In this paper, we consider the realistic setting in which the training data is \emph{not} representative of the target population. We propose a framework for learning policies that generalize to the target population only given data from the training distribution, \ie, \mbox{$\{(X_i, T_i, Y_i)\}_{i=1}^n \sim \mathbb{P}_{\text{Train}}$}.

\subsection{Related Work}\label{sec::related_work}
Despite the vast literature on off-policy learning, less work considers the problem of learning policies that generalize to the target population. Below, we summarize works on off-policy learning and works on external validity in causality, which is closely related to generalizability.

\textbf{Off-policy learning.} Off-policy learning methods can be broadly divided into three categories: (i)~Direct methods estimate the outcome functions \mbox{$\mu_t(x) = \mathbb{E}[Y(t)\mid X=x]$} and plug them into \labelcref{eq:policy_value}, \ie, \mbox{$\hat{V}^{\mathrm{DM}}(\pi) = \frac1n \sm i n \pi(X_i)\hat{\mu}_1(X_i) + (1-\pi(X_i))\hat{\mu}_0(X_i)$} \citep[\eg,][]{bennett2020efficient}. This approach is closely related to estimating the conditional average treatment effect, \ie, $\mathbb{E}[Y(1)-Y(0)\mid X]$ \citep{Shalit2017a, hatt2022combining}. 
Direct methods are known to be weak against model misspecification with regards to $\mu_t(x)$. (ii)~Weighting methods re-weight the outcome data such that it looks as if it were generated by the policy that is evaluated \citep[\eg,][]{bottou2013counterfactual, horvitz1952generalization, kallus2018balanced,li2011unbiased}. A common choice for weights are the normalized inverse propensity weights \citep{swaminathan2015self}, \ie, $\hat{V}^{\mathrm{NIPW}}(\pi) = \frac1n \sm i n 2W_i^{\mathrm{IPW}}(1-2T_i)(1-T_i-\pi(X_i))Y_i/(\frac1n\sm j n W_j^{\mathrm{IPW}})$, where $W_i^{\mathrm{IPW}} = 1/((1-2T_i)(1-T_i-\pi^b(X_i)))$ and $\pi^b(x) = \Prb{T=1\mid X=x}$ is the so-called behavior policy, which was used to generate the training data. (iii)~Doubly robust methods combine direct and weighting methods typically using the augmented inverse propensity weight estimator \citep{athey2017efficient, dudik2014doubly, thomas2016data}. When the direct estimate of $\hat{\mu}_t$ is biased, the doubly robust method weights the residuals by the inverse propensity weights in order to remove the bias, \ie, $\hat{V}^{\mathrm{DR}}(\pi) = \hat{V}^{\mathrm{DM}}(\pi) + \frac1n\sm i n W_i^{\mathrm{IPW}}(1-2T_i)(1-T_i-\pi(X_i))(Y_i - \hat{\mu}_{T_i}(X_i))$. %As a result, either the direct method or the weighting method need to be consistent for the doubly robust estimator to be consistent.

The above methods have become the standard for off-policy learning. Despite their widespread use, the above methods implicitly assume that the training data, which is used to learn the policy, is representative of the target population. As such, when the training data is \emph{not} representative of the target population, we cannot rely on the above methods, as policies may not generalize to the target population. 

\textbf{Distributionally robust optimization.} A related, yet fundamentally different idea is distributionally robust optimization (DRO) \citep[\eg,][]{duchi2018learning}, which studies robustness towards distributional shifts. DRO has found application in off-policy learning by optimizing worst-case policy values \citep{si2020distributionally} and individualized treatment rules \citep{zhao2019robustifying, mo2020learning}. While generalizability is related to DRO, since the difference between training and target distribution can be seen as a distributional shift, it is fundamentally different, as DRO allows for arbitrary changes in distribution. In contrast, generalizability considers a training distribution that is, potentially, not representative of the target population, but derived from the target population. That is, generalizability considers differences in the distributions the arise from an unknown selection mechanism into the training data. Moreover, DRO and its applications require the decision-maker to quantify the distance between training and target distribution in terms of some divergence measure (typically the Kullback-Leibler divergence), which may be notoriously difficult for domain experts such as clinicians. In contrast, our approach allows for user-friendly and intuitive calibration of the parameters involved in the uncertainty set due to recognizing that the differences arises from an unknown selection mechanism.


%Distributional robustness + safe policy learning
%A related, yet fundamentally different idea is distributionally robust optimization (DRO) \citep[\eg,][]{duchi2018learning}, which studies robustness towards distributional shifts. DRO has found application in off-policy learning by optimizing worst-case policy values \citep{si2020distributionally} and individualized treatment rules \citep{zhao2019robustifying, mo2020learning}. While generalizability is related to DRO, since the difference between training and target distribution can be seen as a distributional shift, it is fundamentally different, as DRO allows for arbitrary changes in distribution. In contrast, generalizability considers a training distribution derived from the target population that is, potentially, not representative of the target population. Moreover, DRO requires the decision-maker to guess the difference of training and target distribution in terms of some divergence measure (typically the Kullback-Leibler divergence), which may be notoriously difficult for domain experts such as clinicians. %(ii) Safe policy learning seeks policies for which it can be guaranteed that they can be deployed in the target population \citep{thomas2015high, laroche2019safe, ghavamzadeh2016safe}. Our work is of great benefit for the safe deployment of policies, since we learn policies that can generalize well on the target population and, as such, we be safely deployed.

%Learning policies such that they are robust to distributional shifts can be also achieved by maximize the so-called worst-case quality of a policy \citep{zhao2019robustifying}. Instead of searching the worst-case quality among within a Kullback-Leibler ball around the training distribution, the authors assume that the testing distribution satisfies some moment conditions and search for a worst-case quality among all possible testing distributions.


\textbf{External validity in causality.} Different to policy learning, causal inference aims to estimate causal effects from observational data \citep{bottou2013counterfactual,kuzmanovic2021deconfounding,  hatt2021sequential}. External validity in causal inference is concerned with whether causal effect estimates obtained from a study sample are also valid for the target population. A common approach to address this is to re-weight the data with the inverse of a subject's probability to be selected into the study sample \citep[\eg,][]{buchanan2018generalizing,cole2010generalizing, dahabreh2019generalizing, imai2013estimating, stuart2011use}. This idea has been extended to a doubly robust method for off-policy learning \citep{kato2020off}. Predominantly used in economics, the Heckman correction is another technique that is also based on a subject's selection probability \citep{heckman1979sample}. However, in order to estimate these selection probabilities, all existing approaches assume that data from both the study sample \emph{and} the target population is given. In practice, however, we are only given data from the study sample and not from the target population. Other approaches include approximations of the bias arising from the difference in the study sample and target population by using weights that do not depend on the selection variable \citep{andrews2017weighting}, by bounding the weights directly \citep{aronow2013interval}, or, in addition, by constraining the shape of the population outcome distribution \citep{miratrix2018shape}.

Different to the above approaches and more practically, we do not assume that we have access to samples from the target population and, therewith, we cannot estimate a subject's selection probability. As a remedy, we present our framework for learning generalizable policies in the following.%Moreover, different to the above approaches, we do not only focus on shifts in the distribution of covariates, but allow for any change in distribution.



%However, this approach requires that both samples from the training distribution and samples from the testing distribution are available. This is rarely the case in practice, since it remains challenging to collect data that is representative for the target population. This, however, requires a prior knowledge of the selection probabilities. A prior knowledge of the selection probabilities is only provided in prespecified target populations, which, in practice, is rarely the case.


%Specifically, due to the inclusion and exclusion criteria of an experimental sample, the experimental sample can be unrepresentative of the target population we are interested in. Therefore, the corresponding casual evidence may not be broadly applicable or relevant for the real-world practice.


%%%%%%%%%%%%%OLD Standard policy related work%%%%%%%%%%%%%%%%%%%%%
%\textbf{Off-policy learning.}   
%%Standard policy learning
%Most approaches to off-policy learning can be broadly divided into three approaches: (i) Direct methods estimate the relationship between outcomes and covariates and treatment, and generate a plug-in estimator \citep{qian2011performance, }. This approach is closely related to estimating the conditional average treatment effect \citep[\eg,][]{}. (ii) Weighting-based methods re-weight the outcome data such that the re-weighted data looks like as if it were generated by the policy that is evaluated \citep{kallus2018balanced}. A common choice of such weights are inverse propensity weights \citep{beygelzimer2009offset, li2011unbiased, horvitz1952generalization, kallus2018confounding, bottou2013counterfactual}. (iii) Doubly robust methods combine direct and weighting-based methods such that either of the two needs to be consistent for the doubly robust estimator to be consistent \citep{dudik2014doubly, athey2017efficient, thomas2016data}. The above methods have proven to be effective and, as a consequence, have become the standard for off-policy learning. Despite their the widespread uses, all of the above methods implicitly assume that the observed data which is used to learn the policy is representative for target population. As such, when the observed data is \emph{not} representative for target population, we cannot rely on the above methods, as policies learned may not generalize to the target population.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Generalizing Off-Policy Learning under Sample Selection Bias}
In this section, we introduce our framework for learning policies that generalize to the target population. For this, we first characterize the difference between the training distribution $\mathbb{P}_{\text{Train}}$ and the target distribution $\mathbb{P}$ as a sample selection bias (\Cref{sec::sample_selection_bias}). Then, based on this, we derive an uncertainty set and optimize the minimax policy value to achieve the best worst-case policy value (\Cref{sec::generalizing_policy_learning}). We prove that policies learned in this way do not do worse on the target population than the worst-case policy value and, as such, can be generalized to the target population (\Cref{sec::generalization_guarantee}).

\subsection{Sample Selection Bias}\label{sec::sample_selection_bias}
In this section, we characterize the difference between the training distribution $\mathbb{P}_{\text{Train}}$ and the target distribution $\mathbb{P}$ as a sample selection bias using a selection variable \citep[\eg,][]{cortes2008sample, manski1989anatomy}. This then allows us to characterize the Radon-Nikod\'{y}m derivative $R = \text{d}\mathbb{P}/\text{d}\mathbb{P}_{\text{Train}}$ in \labelcref{eq:change_measure} in terms of the selection variable.

%\TODO{Do we need this?} To this end, we recognize that the training data is usually not arbitrarily different form the target population, but rather a biased sample of the target population arising from the inclusion and exclusion of certain individuals. For instance, \citet{gandhi2005eligibility} highlighted the limited eligibility and, hence, participation of women in HIV clinical trial. As a consequence, the subjects in such clinical trials are not representative of the target population.

We represent the selection bias with a selection variable $S\in\{0,1\}$. If, for a subject, $S=1$, the subject is included in the training data, and, if $S=0$, the subject is excluded from the training data. As a result, we can write the training distribution in terms of the target distribution:
\begin{equation}\label{eq:connection_S}
	\mathbb{P}_{\text{Train}}(\cdot) = \mathbb{P}(\cdot\mid S=1).
\end{equation}
Based on this, we characterize the Radon-Nikod\'{y}m derivative, which enables us to write the policy value on the target population in terms of the selection variable $S$ and the training distribution $\mathbb{P}_{\text{Train}}$.
\begin{proposition}\label{lemma_selection_ratio}
	Under the sample selection bias, we can write the Radon-Nikod\'{y}m derivative $R = \textup{d}\mathbb{P}/\textup{d}\mathbb{P}_{\textup{Train}}$ as
	\begin{equation}
		R = \frac{\Prb{S=1}}{\mathbb P(S=1\mid X,T,Y)},
	\end{equation}
	and, therefore, we can write the policy value on the target population as
	\begin{equation}
		V_{\text{Target}}(\pi) = \mathbb{E}_{\textup{Train}}\left[\frac{\Prb{S=1}}{\mathbb P(S=1\mid X,T,Y)}Y^\pi\right].
	\end{equation}
\end{proposition}
See Appendix~A.1 for a proof. If, hypothetically, we observed $S$, we could estimate $R=\Prb{S=1}/\mathbb P(S=1\mid X,T,Y)$\footnote{Under the standard assumption \mbox{$\mathbb{P} \ll \mathbb{P}_{\text{Train}}$}, the selection variable satisfies positivity, \ie, \mbox{$\mathbb{P}(S=1\mid X,T,Y) > 0$}. Therefore, the ratio $\Prb{S=1}/\mathbb P(S=1\mid X,T,Y)$ is well-defined.} and re-weight the data accordingly in order to obtain the policy value on the target population. However, we never observe the selection variable $S$, since we only observe the training data for which $S=1$. This renders the selection variable $S$ unidentifiable from the training data. Instead, we use an uncertainty set over which we optimize the minimax policy value on the target population.

\subsection{Learning Generalizable Policies by Optimizing Minimax Policy Value}
\label{sec::generalizing_policy_learning}
We derive an uncertainty set around $R = \Prb{S=1}/\mathbb P(S=1\mid X,T,Y)$ over which we maximize the policy value to obtain the worst-case policy value. Then, our framework optimizes the minimax policy value to achieve the best worst-case policy value on the target population.

%To this end, we seek the policy that minimizes the worst-case policy value on the target population. The worst-case policy value is achieved by maximizing the policy value over an uncertainty set around the density ratio $R = R(X,T,Y)$. This guarantees that the policy learned on the training data does not do worse on the target population and, as such, allows for generalizablity of the learned policy.

If we had oracle access to the true Radon-Nikod\'{y}m derivative $R^\ast_i = R_i^\ast(X_i,T_i,Y_i)$, we could estimate the policy value on the target population using \Cref{lemma_selection_ratio}, that is, by re-weighting the data with $R^\ast$. This often leads to high variance estimates due to probabilities close to zero. As a remedy, since $\mathbb{E}[R^\ast] = 1$, we use the empirical sum of the true  Radon-Nikod\'{y}m derivatives as a control variate to normalize the estimate. This gives rise to the following Hajek estimator for the policy value on the target population $V_{\text{Target}}(\pi)$:
\begin{equation}\label{eq:policy_value_estimator}
	\hat{V}_{\text{Target}}^\ast(\pi) = \frac{\sm i n R^\ast_i\psi_i(\pi)}{\sm i n R^\ast_i},
\end{equation}
where $\psi_i(\pi)$ corresponds to one of the three standard methods for policy learning: direct, weighting, and doubly robust methods. Formally, $\psi_i(\pi)$ is either  $\psi_i^{\mathrm{DM}}(\pi)$, $\psi_i^{\mathrm{NIPW}}(\pi)$, or $\psi_i^{\mathrm{DR}}(\pi)$ given as:
\begin{align}
	&\psi_i^{\mathrm{DM}}(\pi) = \pi(X_i)\mu_1(X_i) + (1-\pi(X_i))\mu_0(X_i)\label{psi_dm},\\
	&\psi_i^{\mathrm{NIPW}}(\pi) = \frac{2W_i^{\mathrm{IPW}}(1-2T_i)}{\frac1n\sm j n W_j^{\mathrm{IPW}}}(1-T_i-\pi(X_i))Y_i \label{psi_ipw},\\
	&\psi_i^{\mathrm{DR}}(\pi) = \psi_i^{\mathrm{DM}}(\pi)+ \nonumber\\
	&W_i^{\mathrm{IPW}}(1-2T_i)(1-T_i-\pi(X_i))(Y_i - \mu_{T_i}(X_i))\label{psi_dr}.
\end{align}
The outcome functions $\mu_t(x)$ and the weights $W^{\mathrm{IPW}}$ need to be estimated from data. Any $\psi(\pi)$ in \labelcref{psi_dm}, \labelcref{psi_ipw}, or \labelcref{psi_dr} can be chosen for estimating the policy value as long as the data is re-weighted with the Radon-Nikod\'{y}m derivative $R^\ast$.

Since the true $R^\ast$ is unknown, we instead derive a worst-case policy value on the target population. This allows to ensure that our policy does not do worse than expected once it is implemented in the target population. For this, we maximize \labelcref{eq:policy_value_estimator} over an uncertainty set around $R^\ast$. We consider an uncertainty set motivated by sensitivity analysis in causality \citep[\eg,][]{kallus2018interval, kallus2018confounding,rosenbaum2002overt, zhao2019sensitivity}, which restricts by how much $\mathbb{P}(S=1 \mid X,T,Y)$ can vary from $\mathbb{P}(S=1)$ via the odds-ratio characterization as follows:
\begin{equation}\label{eq:odd_ratio}
	\frac{1}{\Gamma} \leq \frac{\Prb{S=1}(1-\mathbb P(S=1\mid X,T,Y))}{\mathbb P(S=1\mid X,T,Y)(1-\Prb{S=1})}\leq \Gamma,
\end{equation}
where $\Gamma\geq 1$.
%The intuition of the odds-ratio is the following. If we take logarithms in \Cref{eq:odd_ratio}, then bounding the odds-ratio can be seen as bounding the absolute difference between the logits of $\mathbb{P}(S=1)$ and $\mathbb{P}(S=1 \mid X,T,Y)$ by $\text{log}(\Gamma)$.
For $\Gamma=1$, we have equal probability of selection, \ie, $\mathbb{P}(S=1 \mid X,T,Y) = \mathbb{P}(S=1)$ and, thus, no difference between the training data and the target population. Larger values of $\Gamma$ allow for larger variation in the probabilities of selection. The bounded odds-ratio in \labelcref{eq:odd_ratio} immediately yields an uncertainty set for the Radon-Nikod\'{y}m derivative:
\begin{align}
	\cl{R} = &\{R\in\mathbb{R}_{+}^n: l \leq R_i \leq u,\, \forall i\},\\
	%&\text{where}\nonumber\\
	&\text{where } \, l = \frac{1 - \Prb{S=1} + \Gamma \Prb{S=1}}{\Gamma},\\
	&\textcolor{white}{where } \, u = \Gamma(1 - \Prb{S=1}) + \Prb{S=1}.
\end{align}
The uncertainty set $\cl{R}$ includes all Radon-Nikod\'{y}m derivatives $R$ that satisfy the odds-ratio restriction in \labelcref{eq:odd_ratio}. For a given policy, we seek the maximum policy value on the target population among all Radon-Nikod\'{y}m derivatives in the uncertainty set. This yields the following worst-case policy.
\begin{definition}(Worst-case policy value.)
	The worst-case policy value on the target population under the bounded odds-ratio with parameter $\Gamma$ is given by
	\begin{equation}\label{eq:worst_case_plicy_value}
		\overline{V}_{\text{Target}}(\pi; \cl{R}) = \max_{R\in \cl R} \frac{\sm i n R_i\psi_i(\pi)}{\sm i n R_i},
	\end{equation}
	where $\psi_i(\pi)$ corresponds to either \labelcref{psi_dm}, \labelcref{psi_ipw}, or \labelcref{psi_dr}.
\end{definition}
Then, we seek the optimal policy in a policy class $\Pi$, which minimizes the worst-case policy value on the target population, \ie, 
\begin{equation}\label{eq:policy_optimzation}
	\overline{\pi}(\Pi, \cl{R}) \in \argmin_{\pi \in \Pi} \overline{V}_{\text{Target}}(\pi; \cl{R}).
\end{equation}
In particular, a policy learned with our framework generalizes to the target population, since it does not do worse on the target population than the worst-case policy value estimated using the training data. For this, a decision-maker only has to quantify the population selection probability, \ie, $\mathbb{P}(S=1)$ and appropriately choose the maximum deviation from it via $\Gamma$. We discuss data-driven approaches to choose these quantities in \Cref{sec:calibration}. We derive a tailored convex-concave procedure for optimizing \labelcref{eq:policy_optimzation} in \Cref{sec:optimizing_policies}.

\subsection{Theoretical Guarantees for Generalizability}\label{sec::generalization_guarantee}
We prove that, if the Radon-Nikod\'{y}m is appropriately bounded, the worst-case policy value, $\overline{V}_{\text{Target}}(\pi; \cl{R})$, is asymptotically an upper bound for the true policy value on the target population, $V_{\text{Target}}(\pi)$. As such, a policy learned with our framework does not do worse on the target population than the worst-case policy value. Similar to \citep{athey2017efficient}, we express the flexibility of a policy class $\Pi$ using the notion of the Rademacher complexity, \ie, $\cl R_n(\Pi)$.\footnote{The empirical Rademacher complexity of a policy class $\Pi$ is defined as \mbox{$\cl R_n(\Pi) = \frac{1}{2^n}\Sigma_{\sigma\in\{-1, +1\}^n}\sup_{\pi\in\Pi}\lvert\frac1n \sm i n \sigma_i\pi(X_i)\rvert$}.}%\footnote{We express the flexibility of $\Pi$ using the notion of the VC major dimension, which is defined as follows \citep{dudley2010universal}:
%	Given a grounded set $\cl G$ and a set of maps $\cl F \subset [\cl G\rightarrow\mathbb{R}]$, the VC-major dimension of $\cl F$ is the largest number $v\in\mathbb{N}$ such that there exists $g_1, \ldots, g_v\in\cl G$ with 
%	\begin{equation}
%		\{(\mathds{1}(f(g_1)>c), \ldots, \mathds{1}(f(g_v)>c)):\, f\in\cl F, c\in \mathbb R\} = \{0, 1\}^v.
%	\end{equation}
%}
\begin{theorem}(Generalization bound.)\label{thm:gen_bound}
	Suppose the true Radon-Nikod\'{y}m derivative is appropriately bounded, \ie, $R^\ast\in\cl R$ and, therefore, $l\leq R_i^\ast\leq u$, and we have bounded outcomes, \ie, $\vert Y \vert < C$. Then, for a constant $K^{\psi}$ depending on $\psi(\pi)$ and for some $\delta>0$, we have that,
	\begin{equation}\label{eq:lower_bound}
		V_{\text{Target}}(\pi) \leq \overline{V}_{\text{Target}}(\pi; \cl{R}) + 2C\text{\scriptsize$\frac{u}{l}$}K_\psi\Big(\cl{R}_n(\Pi) + \sqrt{\text{\scriptsize$\frac{18\,\textup{log}(4/\delta)}{n}$}}\Big),
	\end{equation}
	with probability at least $1-\delta$ and for any $\pi\in\Pi$.\qed
\end{theorem}
See Appendix~A.2 for a proof. All policy classes we consider have $\sqrt{n}$-vanishing Rademacher complexity, \ie, $\cl{R}_n(\Pi) = \mathcal{O}(n^{-1/2})$. Therefore, \Cref{thm:gen_bound} proves that, asymptotically, $\overline{V}_{\text{Target}}(\pi)$ is an upper bound for $V_{\text{Target}}(\pi)$. This guarantees that $\overline{\pi}(\Pi, \cl R)$ from \labelcref{eq:policy_optimzation} does not do worse on the target population than the worst-case policy value, which is calculated using training data. In particular, since $\overline{\pi}(\Pi, \cl R)$ minimizes the right hand side of \labelcref{eq:lower_bound}, $\overline{\pi}(\Pi, \cl R)$ is the best policy that guarantees to generalize to the target population. Our bound in \Cref{thm:gen_bound} holds without complete knowledge of the selection variable and proves that our framework yields policies that generalizes to the target population. 

Note that in \Cref{thm:gen_bound}, we use the true nuisance functions instead of estimates, since it has been shown that this does not affect the leading term in the convergence rate of the policy value (see \cite{athey2017efficient}; Sec. 3.1, Sec. 3.2, and Lemma 4). This holds true if the nuisance functions have finite second moment and we use consistent estimators for the nuisance functions and $L^2$ errors decay with $1/n^\zeta$, where $\zeta$ depends on the nuisance functions. Hence, to provide a generalization bound on the policy value, it is enough to consider the true nuisance functions as we did in \Cref{thm:gen_bound}.


%\subsection{Calibration of $\mathbf{\Gamma}$ and $\mathbf{\Prb{S=1}}$}\label{sec:calibration_params}
%In this section, we discuss two approaches to calibrate the parameters $\Gamma$ and $\Prb{S=1}$ in \labelcref{eq:odd_ratio}, which are context-dependent: (i)~Practitioner calibration with domain knowledge and (ii)~data-driven calibration.

%\textbf{(i)~Practitioner calibration:} This approach is based on domain knowledge of a practitioner. Practitioners generally have domain knowledge on the variables that impact selection into training data. First, $\Prb{S=1}$, the population probability of inclusion, needs to be quantified. If the study is randomized, a value $\approx 1/2$ is reasonable. Second, $\Gamma$, the largest deviation from $\Prb{S=1}$, needs to be quantified. Practitioners may choose larger values of $\Gamma$ in case of high uncertainty. Our framework allows a practitioner-friendly choice of calibration parameters. In fact, both questions may be simply answered using domain knowledge.
%
%\textbf{(ii)~Data-driven calibration:} Although our framework enables practitioners to choose appropriate calibration parameters, we provide a fully data-driven approach for calibrating $\Gamma$ and $\Prb{S=1}$. To this end, we consider a setting in which samples from \emph{one} of the covariates of the target population are provided. This is reasonable, since we often have limited understanding of the target population and, for instance, know covariates such as the distribution of gender or age in the target population. Once we are given one covariate, \eg, $x_{\text{age}}$, we proceed in two steps: (1)~For calibrating $\Prb{S=1}$, we approximate $\Prb{S=1\mid X, Y, T}$ via an estimate of $\Prb{S=1\mid x_{\text{age}}}$ and, based on this, we approximate $\Prb{S=1}$ by averaging over $x_{\text{age}}$, \ie, $\frac1n\sm i n \Prb{S_i=1\mid x_{\text{age}, i}}$. (2)~For calibrating $\Gamma$, we take the maximum of the odds-ratio in \labelcref{eq:odd_ratio} with the above estimates for $\Prb{S=1}$ and $\Prb{S=1\mid x_{\text{age}}}$ plugged in, which yields a value for $\Gamma$. We use this data-driven calibration procedure in our experiments (\Cref{sec:experiments}).
%
%In case the uncertainty regarding the calibration parameters remains high, large values for $\Gamma$ can be chosen, yielding a wide uncertainty set and conservative policies.

\subsection{Calibration of $\mathbf{\Gamma}$ and $\mathbf{\Prb{S=1}}$}\label{sec:calibration}
In this section, we discuss two approaches to calibrate the parameters $\Gamma$ and $\Prb{S=1}$ in \labelcref{eq:odd_ratio}, which are context-dependent: (i)~Practitioner calibration with domain knowledge and (ii)~data-driven calibration.

\textbf{(i)~Practitioner calibration:} This approach is based on domain knowledge of practitioners about variables that impact selection into training data. First, $\Prb{S=1}$, the population probability of inclusion, needs to be quantified. If the study is randomized, a value $\approx 1/2$ is reasonable. Second, $\Gamma$, the largest deviation from $\Prb{S=1}$, needs to be quantified. Our framework allows a practitioner-friendly choice of calibration parameters. In fact, both questions may be simply answered using domain knowledge.

\textbf{(ii)~Data-driven calibration:} Although our framework enables practitioners to choose appropriate calibration parameters, we provide a fully data-driven approach for calibrating $\Gamma$ and $\Prb{S=1}$. To this end, we consider a setting in which samples from \emph{one} of the covariates of the target population are provided. This is reasonable, since we often have limited understanding of the target population and, for instance, know covariates such as the distribution of gender or age in the target population. Once we are given one covariate, \eg, $x_{\text{age}}$, we proceed in two steps: (1)~For calibrating $\Prb{S=1}$, we approximate $\Prb{S=1\mid X, Y, T}$ via an estimate of $\Prb{S=1\mid x_{\text{age}}}$ and, based on this, we approximate $\Prb{S=1}$ by averaging over $x_{\text{age}}$, \ie, $\frac1n\sm i n \Prb{S_i=1\mid x_{\text{age}, i}}$. (2)~For calibrating $\Gamma$, we take the maximum of the odds-ratio in \labelcref{eq:odd_ratio} with the above estimates for $\Prb{S=1}$ and $\Prb{S=1\mid x_{\text{age}}}$ plugged in, which yields a value for $\Gamma$. We use this data-driven calibration procedure in our experiments (\Cref{sec:experiments}).

In case the uncertainty regarding the calibration parameters remains high, large values for $\Gamma$ can be chosen, yielding a wide uncertainty set and conservative policies.


\section{Optimizing Generalizable Policies}\label{sec:optimizing_policies}
In this section, we derive an efficient algorithm for optimizing the minimax policy value in \labelcref{eq:policy_optimzation}. For this, we consider a parameterized policy class $\Pi=\{\pi(\cdot,\theta):\theta\in\Theta\}$ and the minimax problem
\begin{equation}\label{eq:MMP}
	\min\limits_{\theta \in \Theta} \max\limits_{R\in \cl R} \frac{\sm i n R_i\psi_i(\theta)}{\sm i n R_i},\tag{MMP}
\end{equation}
where $\psi_i(\theta)$ denotes $\psi_i(\pi(\cdot,\theta))$ and corresponds to either \labelcref{psi_dm}, \labelcref{psi_ipw}, or \labelcref{psi_dr}. The above minimax problem is non-trivial, since it is in general non-convex in $\theta$. We first derive a closed-form solution of the worst-case policy value subproblem (\Cref{sec:closed_form_solution_worst_case}). Then, based on this, we develop a tailored convex-concave procedure that solves (\ref{eq:MMP}) (\Cref{sec:MMCCP}).
%Since the uncertainty set $\cl{R}$ is a compact subset of $\mathbb{R}^n$,

\subsection{Closed-Form Solution of Worst-Case Policy Value}\label{sec:closed_form_solution_worst_case}
The solution of (\ref{eq:MMP}) involves the worst-case policy value subproblem in \labelcref{eq:worst_case_plicy_value}.
%\begin{equation}\label{eq:optimzation_problem}
%\overline{V}_{\text{Target}}(\pi; \cl{R}) = \max_{R\in \cl{R}} \frac{\sm i n R_i\psi_i(\theta)}{\sm i n R_i}.
%\end{equation}
We derive a closed-form solution of the subproblem and the corresponding Radon-Nikod\'{y}m derivative at the optimal solution in \Cref{th:closed_form_lower_bound}.
%Since $\cl{S}_n^\Gamma$ involves only linear constraints on $\tilde{r}$ in \labelcref{eq:optimzation_problem} is a linear fractional program. We can reformulate it as a linear program by applying the Charnes-Cooper transformation \TODO{ref}, requiring weights to sum to 0, and rescaling the bounds by a nonnegative scale factor $t$. We obtain the following equivalent linear program, where we let $v_i = \frac{\tilde{r}_i}{\sm i n \tilde{r}_i + n}$:
%
%\begin{equation}\label{eq:linear_problem}
%	\bar{V}(\pi; \cl{S}) = max_{v, t \geq 0}\{\sm i n v_i \psi_i:\, \sm i n v_i = 0;\, t\tilde{l}(\Gamma) \leq v_i \leq \tilde{u}(\Gamma)t,\, \forall i=1, \ldots, n\}
%\end{equation}
%
%
%The dual problem to \Cref{eq:linear_problem} has dual variables $\lambda\in\mathbb{R}$ for the weight normalization constraint and $u, s \in \mathbb{R}_{+}^n$ for the lower and upper bound constraints on the weights, respectively, and is given by
\begin{theorem}(Closed-form solution of worst-case policy value.)\label{th:closed_form_lower_bound}
	Let $(i)$ denote the ordering such that $\psi_{(1)}(\theta) \leq \ldots \leq \psi_{(n)}(\theta)$. Then, an optimal solution of the worst-case policy value subproblem \labelcref{eq:worst_case_plicy_value} is given by%the corresponding optimal Radon-Nikod\'{y}m derivative at the solution
	\begin{equation}
		\overline{V}_{\text{Target}}(\pi; \cl{R}) = \frac{l\sm i {k^\ast} \psi_{(i)}(\theta) + u\sum_{i=k^\ast+1}^n \psi_{(i)}(\theta)}{lk^\ast + u(n-k^\ast)},
	\end{equation}
	with 
	\begin{align}
		&k^\ast = \inf\Big\{k\in\{0,\dots,n\}:\\
		&\frac{l\sm i {k} \psi_{(i)}(\theta) + u\sum_{i=k+1}^n \psi_{(i)}(\theta)}{lk + u(n-k)} \leq \psi_{(k+1)}(\theta)\Big\}.
	\end{align}
	The Radon-Nikod\'{y}m derivative at optimal solution is given by $R_{(i)} ={l\mathds{1}\{{(i)} \leq k^\ast\} + u\mathds{1}\{{(i)} > k^\ast\}}$.
\end{theorem}
%The optimization problem in \labelcref{eq:worst_case_plicy_value} is a linear fractional program (LFP), since the constraints arising from optimizing over the uncertainty set $\cl{R}$ are linear. Hence, we can transform the LFP into a linear program (LP) using the Charnes-Cooper transformation \citep{charnes1962programming}, and solve it by dualizing the LP. 
See Appendix~A.3 for a proof. \Cref{th:closed_form_lower_bound} is appealing for two reasons: (i)~We prove that $\overline{V}_{\text{Target}}(\pi; \cl{R})$ is efficiently solved by a linear search over the sorted data. (ii)~We prove that the worst-case policy value is given by a maximum over a finite set, which we use in the following section to show that the minimax problem can be written as a difference-of-convex functions~(DC) problem. Based on this, we develop a convex-concave procedure to efficiently solve the minimax problem in (\ref{eq:MMP}).

\subsection{Minimax Convex-Concave Procedure}\label{sec:MMCCP}
In this section, we develop the \textbf{m}ini\textbf{m}ax \textbf{c}onvex-\textbf{c}oncave \textbf{p}rocedure~(\MMCCP) to efficiently solve the minimax problem (\ref{eq:MMP}). For this, we derive a DC-representation of the worst-case policy value based on its closed-form solution in \Cref{th:closed_form_lower_bound}. For this, the following assumptions are made.
\begin{assumption}\label{ass:DC_assumption}
	The set $\Theta$ is nonempty, compact, and convex. Furthermore, $\pi$ is a DC-function in $\theta$, \ie, $\pi(X, \theta)=\tilde g(X, \theta)- \tilde h(X, \theta)$, where $\tilde g$ and $\tilde h$ are convex in $\theta$, and differentiable.
\end{assumption}
Note that \Cref{ass:DC_assumption} is very general as the class of DC-functions is very rich. For instance, it includes all twice continuously differentiable functions \citep{horst1999dc}. We later show that \Cref{ass:DC_assumption} is fulfilled for the established policy class of logistic policies. %Moreover, every continuous function can be approximated arbitrarily well by a DC-function, since the set of DC-functions defined on a compact convex set $\Omega\subseteq\mathbb{R}^n$ is dense in $\mathcal{C}(\Omega)$ \citep{horst1999dc}.
First, we show that $\psi_i(\theta)$ can be written as a DC-function.
\begin{lemma}(DC-representation of $\psi_i(\theta)$.)\label{lm:dc_representation_of_Psi}
	%hold and $\psi_i(\theta)$ as in \labelcref{psi_dm}, \labelcref{psi_ipw}, or \labelcref{psi_dr}
	Under \Cref{ass:DC_assumption}, $\psi_i(\theta)$ is a DC-function in $\theta$, \ie, 
	\begin{equation}
		\psi_i(\theta)=g_i(\theta)-h_i(\theta),
	\end{equation}
	where $g_i$ and $h_i$ are convex in $\theta$.
\end{lemma}
See Appendix~A.4 for a proof. Now, using \Cref{lm:dc_representation_of_Psi} and \Cref{th:closed_form_lower_bound}, we prove that the worst-case policy value can be written as a DC-function.
\begin{theorem}(DC-representation of worst-case policy value.)\label{th:dc_representation_of_WCP}
	Under \Cref{ass:DC_assumption}, the worst-case policy value $\overline{V}_{\text{Target}}(\pi; \cl{R})$ is a DC-function in $\theta$, \ie,
	\begin{equation}
		\overline{V}_{\text{Target}}(\pi; \cl{R})=g(\theta)-h(\theta),
	\end{equation}
	where $g(\theta)$ and $h(\theta)$ are convex and given by
	\begin{align}
		&g(\theta)=\max_{R\in \cl{R}} \frac{\sm i n R_i\psi_i(\theta)}{\sm i n R_i}+\sm i n h_i(\theta)c_i,\\
		&h(\theta)=\sm i n h_i(\theta)c_i,
	\end{align}
	with $g_i$ and $h_i$ from \Cref{lm:dc_representation_of_Psi}, and non-negative constants $c_i$ for all $i$.
\end{theorem}
See Appendix~A.5 for a proof. Finally, with the DC-representation of the worst-case policy value in \Cref{th:dc_representation_of_WCP}, we can write the original minimax problem in (\ref{eq:MMP}) as a DC-program, \ie, 
\begin{equation}\label{eq:DC_representation_MMP}
	\min\limits_{\theta \in \Theta} g(\theta)-h(\theta),
\end{equation}
where $g(\theta)$ and $h(\theta)$ are convex and given in \Cref{th:dc_representation_of_WCP}. Hence, we can solve the minimax problem via a convex-concave procedure \citep{sriperumbudur2009convergence, yuille2003concave}. This yields our tailored \MMCCP for solving (\ref{eq:MMP}) as outlined in \Cref{alg:MMCCP}.
\begin{algorithm}[tb]
	\caption{\MMCCP}
	\label{alg:MMCCP}
	\begin{algorithmic}
		\STATE {\bfseries Input:} Initial theta $\theta^0$, convergence tolerance $\delta_\mathrm{tol}$
		\STATE Set $k\gets0$
		\REPEAT
		\STATE Solve the \emph{convex} problem:
		\STATE $\theta^{k+1}\in\argmin_{\theta\in\Theta} \max_{R\in \cl{R}} \frac{\sm i n R_i\psi_i(\theta)}{\sm i n R_i}+\sum\limits_{i=1}^{n}c_i(h_i(\theta)-\langle \theta,\nabla h_i(\theta^k)\rangle)$
		\STATE Set $k\gets k+1$
		\UNTIL{$\lVert \theta^{k}-\theta^{k-1}\rVert<\delta_\mathrm{tol}$}
		%\newline\textbf{return} $\theta_k$
	\end{algorithmic}
\end{algorithm}
Next, we prove that the sequence $(\theta^k)_{k\in\mathbb{N}}$ generated by \MMCCP yields monotonically decreasing worst-case policy values and converges under mild assumptions.
\begin{theorem}(Theoretical Analysis of \MMCCP.)\label{th:convergence}
	Suppose the outcomes are bounded, \ie, $\vert Y \vert < C$, and \Cref{ass:DC_assumption} holds. Then, the following holds true:
	\begin{enumerate}
		\item The sequence $(\theta^k)_{k\in\mathbb{N}}$ generated by \MMCCP satisfies the monotonic descent property, \ie, for all $k\in\mathbb{N}$,
		\begin{align}
			\max\limits_{R\in \cl R} \frac{\sm i n R_i\psi_i(\theta^{k+1})}{\sm i n R_i}\le \max\limits_{R\in \cl R} \frac{\sm i n R_i\psi_i(\theta^k)}{\sm i n R_i}.
		\end{align}
		\item If $\tilde g$ and $\tilde h$ from \Cref{ass:DC_assumption} are strongly convex,\footnote{A function $f$ is strongly convex, if $\rho(f)>0$, where $\rho(f)$ is the modulus of strong convexity of a convex function $f$, which is defined as $\rho(f)=\sup\{\rho\ge0:f(\cdot)-\frac{\rho}{2}\lVert\cdot\rVert_2^2 \text{ is convex}\}$.} then every limit point $\theta^\ast$ of $(\theta^k)_{k\in\mathbb{N}}$ is a stationary point\footnote{Note that the objective function is in general not differentiable, see Appendix~D. Hence, we consider stationary points in the context of convex analysis, \ie, \mbox{$0\in\partial g(\theta^\ast)\cap\partial h(\theta^\ast)$}, where $\partial$ denotes the subgradient.} of (\textup{\ref{eq:MMP}}). Furthermore, it holds: $\lim\limits_{k\to\infty} \lVert \theta^{k+1}-\theta^k\rVert=0$.
	\end{enumerate}
\end{theorem}
See Appendix~A.6 for a proof. To summarize, we develop a tailored convex-concave procedure that efficiently solves (\ref{eq:MMP}). This is only possible since we proved that the worst-case policy value has a DC-representation (see \Cref{th:dc_representation_of_WCP}). In particular, our algorithm can be used on a rich class of policies and converges under mild assumptions. We now demonstrate that \Cref{ass:DC_assumption} holds for an established parameterized policy class which we use in our experiments.

\textbf{Logistic policies:} Logistic policies are defined by $\pi(X,\theta)=\sigma(\theta^\intercal X)$, where $\sigma(z)=1/(1+e^{-z})$. To find a DC-representation, it is sufficient to decompose $\sigma(z)$. Hence, we set $z=\theta^\intercal X$ and write
\begin{align}
	\tilde g_{\mathrm{log}}(z)&=\begin{cases}\frac{1}{4}z+\frac{1}{2}, & \text{if } z\ge 0,\\ \frac{1}{2}\tanh(\frac{1}{2}z)+\frac{1}{2}, & \text{else,}\end{cases},\\
	\tilde h_{\mathrm{log}}(z)&=\begin{cases}\frac{1}{4}z-\frac{1}{2}\tanh(\frac{1}{2}z), & \text{if } z\ge 0,\\ 0, & \text{else.}\end{cases}
\end{align}
It is straightforward to check that both functions are convex. They can be made strongly convex by adding $\frac{\lambda}{2}z^2$ to both functions. Since $\tilde g_{\mathrm{Log}}$ and $\tilde h_{\mathrm{Log}}$ are differentiable, \Cref{ass:DC_assumption} is fulfilled and, hence, \MMCCP converges for logistic policies. In Appendix~C, we show that \Cref{ass:DC_assumption} also holds for linear policies. In addition, logistic policies also satisfy the generalization bound in \Cref{thm:gen_bound}, since they have $\sqrt{n}$-vanishing Rademacher complexity. This can be seen by using that $\sigma$ is Lipschitz together with the Rademacher bound for linear classes \citep{maurer2006rademacher} and the scalar concentration inequality \citep{maurer2016vector}.

\section{Experiments}\label{sec:experiments}
%Plots
\begin{figure*}[t]%{r}{0.5\textwidth}
	%\vspace{-40pt}
	%\centering
	\scalebox{0.525}{\hspace*{-2.5cm}\includegraphics{Simulation_results.pdf}}
	\caption{\footnotesize Policy regret improvement of our methods (blue) over the baseline methods (green) on the target population for different values of $\Gamma$. Compared to the baseline methods (\ie, DM, NIPW, and DR), our methods (\ie, GenDM, GenNIPW, and GenDR) show superior generalizability and, as such, improve the policy regret by up to 40\,\% at the true $\Gamma^\ast = 8$. Lower is better.}\label{fig:sim_res}%
\end{figure*}
In this section, we compare several policy learning methods to policies learned with our framework on the example of logistic policies. We demonstrate that our framework generalizes substantially better to the target population.
\subsection{Simulation Study}\label{sec::simulation_study}
We first consider a simulation study to demonstrate the effect of unrepresentative training data. For this, we consider the following data-generating process for the target population:
\begin{align}
	&\vc{X}\sim\mathcal{N}_5(\vc{\mu}, \vc{I}_5), \quad T\mid \vc{X} \sim \text{Bern}(\nicefrac12),\\
	&Y\mid (\mathbf{X}, T) = m(\mathbf{X}) + T\cdot C(\mathbf{X}) + \epsilon,
\end{align}
where $m(\mathbf{X})= \beta_0^\intercal\vc X + 3\xi$, $C(\mathbf{X}) = \nicefrac52 + \beta_1^\intercal\vc X-4\xi$, $\xi\sim \text{Bern}(\nicefrac12)$, and $\epsilon \sim \cl N (0,1)$. The covariate mean is $\mu = [\text{-}1, \nicefrac12, \text{-}1, 0, \text{-}1]$ and the outcome means are $\beta_0 = [0, \nicefrac34, \text{-}\nicefrac12, 0, \text{-}1]$ and $\beta_1 = [\text{-}\nicefrac32, 1, \text{-}\nicefrac32, 1, \nicefrac12]$, respectively. We sample from the target population using the following selection variable
\begin{equation}\label{eq:selection_var_sim}
	S \sim \text{Bern}\Big(\text{\scriptsize$\frac12$} + \text{\scriptsize$\frac{0.95}{2}$}\tanh(\text{-}10\,C(\vc X))\Big),
\end{equation}
which yields training data that is unrepresentative for the tail of the covariate distribution.\footnote{Note that we multiply the second term in \labelcref{eq:selection_var_sim} with $\frac{0.95}{2}$ to ensure that the selection probability remains strictly positive.} As baselines, we consider three established policy learning methods: the direct method (\textbf{DM}), the normalized inverse propensity weights method (\textbf{NIPW}), and the doubly robust method (\textbf{DR}). We compare these established methods against our generalizable methods with each of the three $\psi(\theta)$ in \labelcref{psi_dm}, \labelcref{psi_ipw}, and \labelcref{psi_dr}: the worst-case policy value obtained with the direct method (\textbf{GenDM}), obtained with the normalized IPW (\textbf{GenNIPW}), and with the doubly robust method (\textbf{GenDR}). We use kernel and logistic regression for estimating $\mu_t(x)$ and $W^{\mathrm{IPW}}$. The parameter $\mathbb{P}(S=1)$ is chosen by the data-driven calibration in \Cref{sec:calibration} and $\Gamma$ is varied across $\{1.0,1.2, 1.4, 1.6, 1.8, 2.0, 3.0, \ldots, 10.0\}$. Details on implementation of \MMCCP are in Appendix~D.

%Results/observations
We present the results for the different values of $\Gamma$ in \Cref{fig:sim_res}. Specifically, we show by how much our methods improve over the policy regret, $V_{\text{Target}}(\hat{\pi})-V_{\text{Target}}(\pi^\ast)$, of the corresponding baseline policy (\ie, DM, NIPW, and DR) when tested on the target population. Our methods achieve lower policy regrets on the target population across all methods and across all values of $\Gamma$. Specifically, relative to the policy regret of the baseline policy (green line), our methods (blue line) improve the policy regret on the target population by up to 40\,\%. By construction, for $\Gamma=1$ (left end of plots), our methods resemble the baseline methods and yield the same policy regret. When we increase $\Gamma$, our policies achieve substantial improvements of the policy regret on the target population over the baselines. The best policy regret on the target population is achieved for $\Gamma=8$, which is consistent with the simulation specifications, as the true $\Gamma^\ast=8$. For $\Gamma=8$, relative to the baseline policies, our methods improve the policy regret by up to 40\,\%. This demonstrates that policies learned with our framework generalize substantially better to the target population.

\subsection{Experiments on Clinical Trial Data}\label{sec::experiments}
We evaluate our methods using the AIDS Clinical Trial Group (ACTG) study 175 \citep{hammer1996trial}, which is particularly suited for evaluating our framework. This is because HIV-positive females tend to be underrepresented, which makes these studies not representative of the target population (\ie, the HIV-positive population in the USA) \citep{gandhi2005eligibility, greenblatt2011priority}. In fact, in the ACTG 175 study, only 5.8\,\% of the patients are female, whereas HIV-positive females are more common in the USA population. The outcome $Y$ is the difference between the cluster of differentiation 4 (CD4) cell counts at the beginning of the study and the CD4 counts after $20\pm 5$ weeks. The average treatment effects on the male and female subgroups are -8.97 and -1.39, respectively, suggesting a large discrepancy in treatment effects between both subgroups. We consider two treatment arms: one treatment arm for both zidovudine (ZDV) and zalcitabine (ZAL) ($T=1$) vs. one treatment arm for ZDV only ($T=0$), comprising $1,056$ patients in total. We consider 12 covariates (details on the covariates are in Appendix~B). Again, we compare our methods against the established baseline methods. This is a real-world clinical trial and, hence, we cannot access the true policy values on the target population. However, we investigate the behavior of our policies by studying the percentage of patients that are treated (\ie, $\pi(X)>0.5$) for varying $\Gamma$. For our GenDR, the result is presented in \Cref{fig_clinical_res}. The results for GenDM and GenNIPW are in Appendix~E. We find that, compared to the baseline policy, our policy treats fewer patients for increasing $\Gamma$. This seems reasonable, since females are underrepresented and have a lower average treatment effect. Specifically, the baseline policy tends to treat more patients, since there are more patients in the study that benefit from the treatment. However, in the target population (with a greater proportion of females), fewer patients are expected to benefit (due to the lower treatment effect in the female subgroup). Our policy accounts for the underrepresentation of females and, as such, tends to treat fewer patients. This result indicates the potential of our framework for learning policies that generalize to the target population.

%We study the generalizability of policies learned by our framework on ACTG studies when the subjects in the study are not representative of the target population. Specifically, the ACTG 175 study with the original male/female proportion is considered as target population, while the training dataset is a subsample of the study, in which women with living with HIV are underrepresented.

%In particular, a unrepresentative subsample of the target population is obtained by using the selection variables $S$, with $\mathbb{P}(S=1\mid \vc X=\vc x) = 0.1 + 0.8\cdot(1-x_{\text{gender}})$, where $x_\text{gender} \in\{0,1\}$ and $x_\text{gender} = 1$ corresponds to a female subject.

%As a result, women are heavily underrepresented in the training dataset, leading to an unrepresentative training dataset.

%Findings


%Plot (just one method and rest in appendix)
%Plots
%\begin{wrapfigure}{r}{0.5\textwidth}
\begin{figure}[t]%{r}{0.5\textwidth}
	%\vspace{-10pt}
	%\vspace{-1.5em}
	\centering
	\scalebox{0.6}{\includegraphics{clinical_res_dr.pdf}}
	\caption{\footnotesize Percentage of patients with $\pi(X)>0.5$ for our GenDR policy method. Fewer patients are treated for increasing $\Gamma$.}\label{fig_clinical_res}
\end{figure}
%\end{wrapfigure}
\section{Conclusion}
We propose a novel framework for learning policies that generalize to the target population by optimizes the minimax policy value on the target population. %, which optimizes the minimax policy value to achieve the best worst-case policy value on the target population. %Over an uncertainty set around a selection variable, our framework optimizes the minimax policy value to achieve the best worst-case policy value on the target population. 
We prove that our framework yields policies that do not do worse on the target population than the worst-case policy value. %As such, this has important implications for safe and reliable policy learning (\ie, safe policy learning or safe reinforcement learning \citep{laroche2019safe, ghavamzadeh2016safe, thomas2015high}), since we can guarantee that the upper bound on the policy value holds up to a certain probability. 
We solve the minimax problem via a tailored convex-concave procedure for which we prove convergence for parametrized spaces of policies. Experiments demonstrate the benefit of learning generalizable policies using our framework.

\subsection*{Acknowledgements}
We thank the anonymous reviewers for valuable feedback.
This work was supported by the Swiss National Science Foundation grants number 186932.

%\clearpage
\bibliography{library}
%\bibliographystyle{plain}

\end{document}
