%\subsection{Notation}

We use bold uppercase letters for a set of random variables, e.g., $\bm{X}$, while a single random variable is denoted using normal uppercase letters, e.g., $A$. 
In addition, members of a set of random variables will be indexed by subscripts, e.g., $X_i$ denotes the $i^{th}$ random variable in set $\bm{X}$. The size of set of random variables $\bm{X}$ is denoted as $|\bm{X}|$.
The instantiations (configurations) of random variables are denoted as lowercase letters. For example, $\bm{x}$ is one possible configuration for all variables in $\bm{X}$ and $a$ is a possible value that the random variable $A$ can take. 
If $\bm{Y}$ is a subset of $\bm{X}$, then the projection of assignment $\bm{x}$ onto set $\bm{Y}$ is denoted as $\bm{x}_{ \bm{Y} }$.  All random variables considered in this paper are assumed to be real-valued unless otherwise noted. 


\subsection{Distributionally Robust Supervised Learning}
In this section, we first introduce the Empirical Risk Minimization (ERM) framework for supervised learning along with its connection to Maximum Likelihood Estimation (MLE). We then describe the Distributionally Robust Supervised Learning (DRSL) framework, which incorporates distributional robustness into the ERM framework and show that the DRSL framework can be employed as a surrogate for MLE to learn robust probabilistic models.

\subsubsection{Empirical Risk Minimization (ERM)}
In a typical supervised learning setting, the input training data is assumed to be free of corruption and noise, and our objective is to find the model parameters, $\theta$, that minimize the expected loss with respect to the unknown data distribution, i.e.,
$$
\argmin_{\theta} \mathbb{E}_{(\bm{x}, {y}) \sim P(\bm{X}, {Y})} [\mathcal{L}(\bm{x}, {y}, \theta)].
$$
Here, $P(\bm{X}, {Y})$ is the unknown data distribution over the input feature variables $\bm{X}$ and label variable $Y$; and $\mathcal{L}(\bm{x}, {y}, \theta)$ is the loss function.
Given a dataset $\mathcal{D} = \{ (x_i,y_i) | i=1,...,n \}$ that is assumed to consist of i.i.d samples drawn from the distribution $P(\bm{X}, {Y})$, the above expectation optimization problem can be approximated as 
$$
\theta^* = \argmin_{\theta} \frac{1}{n} \sum_{i=1}^n \mathcal{L}_i( \theta),
$$
where $\mathcal{L}_i( \theta)  \equiv \mathcal{L}(\bm{x_i}, {y_i}, \theta) $ is the loss respect to the $i^{th}$ input data instance.

%~\footnote{In case of the classical MSE loss, we would have $\mathcal{L}(\bm{x}, {y}, \theta) = \Vert M_{\theta}(\bm{x})- y \Vert ^ 2$.}. 
\subsubsection{Maximum Likelihood Estimation (MLE) as ERM}
\label{sec:ll_and_erm}
Given data observations $\mathcal{D} = \{ \bm{z}_1,  ..., \bm{z}_n \}$ that are drawn independently from distribution $P_{\theta}(Z)$ with unknown parameter $\theta$, the goal of MLE is to find the parameters that maximize the log-likelihood, i.e.,
$$
\argmax_{\theta} \log \mathcal{L_{\theta}}(\mathcal{D}) = \argmax_{\theta} \sum_{i=1}^n \log P_{\theta} (\bm{z_i}).
$$
This can be put into the ERM framework by setting the loss function as the negative log-likelihood, $\mathcal{L}_i(\theta) = -\log P_{\theta} (\bm{z_i})$, and observing that 
$$
\theta^* = \argmax_{\theta} \sum_{i=1}^n \log P_{\theta} (\bm{z_i}) = \argmin_{\theta} \frac{1}{n} \sum_{i=1}^n \mathcal{L}_i(\theta) ,
$$ 
which means we can use ERM framework to solve the MLE task. Note that scaling the objective function by \nicefrac{1}{n} doesn't change the optimal parameters, $\theta^*$.

\subsubsection{DRSL formulation}
Unlike ERM, DRSL~\citep{bauso2017distributionally,namkoong2016stochastic} is explicitly formulated for the cases where the test distribution $Q$ is shifted from the training distribution $P$. Specifically, DRSL considers the test distribution $Q$ from an uncertainty set $U_{P,\delta}$ that contains all distributions within a $\delta$ f-divergence from the distribution $P$, i.e.,
$$
U_{P, \delta} = \{ Q| D_f(Q,P) \le \delta \},
$$
where $D_f(Q,P) = \mathbb{E}_{P} \left [ f \left (\nicefrac{Q}{P} \right)  \right]$ is the f-divergence between distribution $Q$ and $P$, with a convex function $f(\cdot)$ that satisfies $f(1)=0$ ($P = Q$ implies zero distance).  Note that the support of distribution $Q$ is assumed to be a subset of the support of distribution $P$. In other words, $P(x) = 0$ implies $Q(x) = 0$. 
When $f(x) = x \log x$, the f-divergence reduces to the well-known Kullback-Leibler divergence~\citep{kullback1951information}. 
The $\delta$ in the above equation is a hyper-parameter that controls the amount of distributional shift. When $\delta = 0$, DRSL reverts to standard ERM learning.


The objective of DRSL is to find the best parameter $\theta$ that minimizes the expected loss with respect to the worst test distribution $Q \in U_{P,\delta}$, and it can be formulated as a minimax problem as follows.
$$
\argmin_{\theta} \sup_{Q \in U_{P,\delta} } \mathbb{E}_{(\bm{x}, {y}) \sim Q(\bm{X}, {Y})} [\mathcal{L}(\bm{x}, {y}, \theta)]
$$
Setting $r(\bm{x},y) = \frac{Q(\bm{x},y)}{P(\bm{x},y)}$, we can reformulate the objective as 
$$
\argmin_{\theta} \sup_{r \in \mathcal{U}_{P,\delta} } \mathbb{E}_{(\bm{x}, {y}) \sim P(\bm{X}, {Y})} [r(\bm{x},y) \mathcal{L}(\bm{x}, {y}, \theta)],
$$
where 
\begin{align*}
\mathcal{U}_{P, \delta} =  \{   r(\bm{x},y) | & \mathbb{E}_{ P(\bm{X}, {Y})} [f( r(\bm{x},y) ) ] \le \delta, \\
 &  \mathbb{E}_{ P(\bm{X}, {Y})} [ r(\bm{x},y) ] = 1, \\
 &  r(\bm{x},y) \ge 0 \}.
\end{align*}
The first constraint in the set $\mathcal{U}_{P, \delta}$ guarantees that the f-divergence between $Q$ and $P$ is less or equal to $\delta$ while the second and third constraints guarantee $Q$ is a valid distribution.
Similar to the ERM case, the expectations in the above formulation can be approximated using samples as
\begin{equation}
\label{eq:arm}
\argmin_{\theta} \sup_{\bm{r} \in \hat{ \mathcal{U} }_{\delta}}  \frac{1}{n} \sum_{i=1}^n r_i \cdot \mathcal{L}_i(\theta),
\end{equation}
where
$$
\hat{ \mathcal{U} }_{\delta} = \left \{   \bm{r} \bigm| \frac{1}{n} \sum_i f(r_i) \le \delta, \frac{1}{n} \sum_i r_i = 1, r_i \ge 0  \right \},
$$
$r_i = r(\bm{x_i}, y_i)$, and $\bm{r} = (r_1,r_2,...,r_n)$ is the vector of density ratios.
This problem can be treated as a minimax game between  an adversary and a learner in which the adversary reweights the losses of all data instances using $\bm{r}$, and the learner then tries to minimize the weighted loss~\citep{hu2018does,bauso2017distributionally}. 

As discussed in Section~\ref{sec:ll_and_erm}, by choosing the loss function as the negative log-likelihood, $\mathcal{L}_i(\theta) = -\log P_{\theta} (\bm{z_i})$, and plugging it into \eqref{eq:arm}, we can use the DRSL framework to learn a robust probabilistic model $P_{\theta}(\bm{Z})$.
In fact, as we will shown later in Section~\ref{sec:method}, the inner maximization problem can be solved exactly in linearithmic time when KL-divergence is employed -- assuming that the loglikelihoods $\log P_{\theta}(\bm{z})$ can be computed efficiently and the probabilistic model admits efficient learning on weighted data. 

% \subsection{Continuous Tractable Probabilistic Models}
% \label{sec:nn-gbn}
% % Give an brief introduction of our previous work from the third-personal view.
% % need to explain why choose this work a little bit

% In this paper, we focus on the distributional robustness within the realm of \emph{continuous} tractable probabilistic models. 
% We follow the work of \citet{dong2022conditionally} and use the NN-GBN model~\footnote{The model introduced by the authors remains unnamed in their publication, and for the purposes of this study, we will refer to it as NN-GBN.} they proposed as a case study. 
% Compared to other continuous TPMs~\citep{molina2018mixed,madeira2022tractable} based on the SPN framework~\citep{poon2011sum}, NN-GBN can be easily adapted for learning on weighted data, while the weighted structure and parameter learning for SPN remains unclear~\footnote{To the best of our knowledge, no established structure learning algorithm for SPNs on weighted data exists.}. 

% NN-GBN models the full joint distribution over a set of random variables $\bm{Z}$ as the product of two parts: a local distribution $P(\bm{X})$ over a small subset of variables $\bm{X} \subset \bm{Z}$ and a fully tractable conditional distribution $P(\bm{Y}|\bm{X})$ over the remaining variables $\bm{Y} = \bm{Z} \backslash \bm{X}$, i.e., 
% $$
% P(\bm{Z}) = P(\bm{X}) \cdot P(\bm{Y} | \bm{X}).
% $$
% The local distribution $P(\bm{X})$ is modelled using a mixture of multivariate Gaussian (MixMG) distributions, while $P(\bm{Y} | \bm{X})$ is modelled using a Gaussian Bayesian network (GBN)~\citep{grzegorczyk2010introduction} whose parameters, $\theta$, are selected by a neural network (NN) that takes the assignment $\bm{X} =  \bm{x}$ as input, i.e.,
% $$
% P(\bm{z}) = P(\bm{x}, \bm{y})  = MixMG(\bm{x}) \cdot  GBN_{\theta}(\bm{y}),
% $$
% where $\theta = NN(\bm{x})$. The neural network models the non-linear dependence between $\bm{X}$ and $\bm{Y}$ and yields a potentially infinite Gaussian mixture model over $\bm{Z}$~\citep{dong2022conditionally}.
% NN-GBN allows \emph{tractable exact inference} (including the Marginal Max-a-Posterior (MMAP) task) when all variables in the local distribution $P(\bm{X})$ are observed. Otherwise, the authors show that cutset sampling can be employed to efficiently generate accurate predictions in practice.










