\begin{table}[t]

\centering
\caption{Number of instances and features of nine datasets (after preprocessing).}
\vspace{0 pt}
\tabcolsep 5pt
   
 \begin{tabular}{ccc}
 \hline
 Name & \#instance & \#feature \bigstrut\\
 \hline
 airquality & 9357 & 12 \bigstrut[t]\\
 energy & 19735 & 24 \\
 hepmass & 150000 & 21 \\
 miniboone & 36488 & 43 \\
 onlinenews & 39644 & 32 \\
 parkinson & 5875 & 15 \\
 sdd & 58509 & 29 \\
 superconduct & 21263 & 68 \\
 mnist (d20) & 70000 & 20 \\
 \hline
 \end{tabular}
 \label{tab:dataset}
\end{table} 




\begin{table*}[t]
\renewcommand{\arraystretch}{1.0}
\centering
\caption{Loglikelihood scores, average test LL scores and number of wins (ties are ignored) of robust and standard MixMG models on various test sets and neighboring regions. The robust model learned through our algorithm achieved higher or similar average LL scores on most of the adversarial cases.}
\vspace{0 pt}
\tabcolsep 3pt
\begin{tabular}{ccccccc}
\hline
Dataset & Method & Original Test & Gaussian Test & Jittter Test & Worst NB & Average NB \\ \hline
\multirow{2}{*}{airquality} & DRSL & 9.29 ± 4.6 & -274.58 ± 201.1 & -322.60 ± 1378.4 & -290.95 ± 133.7 & -104.09 ± 74.6 \\
 & MLE & 9.53 ± 6.3 & -347.31 ± 205.3 & -879.70 ± 4533.5 & -380.05 ± 130.6 & -137.12 ± 74.4 \\
\multirow{2}{*}{energy} & DRSL & -7.55 ± 5.3 & -32.03 ± 17.0 & -63.25 ± 64.3 & -50.29 ± 20.2 & -21.92 ± 7.7 \\
 & MLE & -6.58 ± 6.6 & -39.75 ± 25.6 & -91.09 ± 98.3 & -67.26 ± 29.7 & -23.71 ± 9.4 \\
\multirow{2}{*}{hepmass} & DRSL & -25.15 ± 4.9 & -29.24 ± 3.9 & -26.08 ± 4.6 & -31.31 ± 2.9 & -27.06 ± 4.1 \\
 & MLE & -24.36 ± 4.7 & -28.52 ± 4.6 & -25.46 ± 4.9 & -30.89 ± 4.2 & -26.31 ± 4.3 \\
\multirow{2}{*}{miniboone} & DRSL & -24.28 ± 15.6 & -43.80 ± 14.0 & -53.85 ± 18.5 & -51.96 ± 12.3 & -32.67 ± 13.1 \\
 & MLE & -21.41 ± 14.8 & -44.58 ± 13.9 & -56.18 ± 20.9 & -53.59 ± 11.6 & -31.38 ± 12.8 \\
\multirow{2}{*}{mnist} & DRSL & -3.57 ± 6.0 & -6.01 ± 2.3 & -6.56 ± 2.3 & \textit{N/A} & \textit{N/A} \\
 & MLE & -0.59 ± 6.4 & -4.41 ± 4.2 & -5.87 ± 4.0 & \textit{N/A} & \textit{N/A} \\
\multirow{2}{*}{onlinenews} & DRSL & -1.61 ± 19.2 & -240.22 ± 229.9 & -1009.90 ± 1403.0 & -627.79 ± 84.8 & -115.92 ± 17.2 \\
 & MLE & -1.22 ± 27.5 & -257.00 ± 237.3 & -1013.92 ± 1404.9 & -632.71 ± 99.9 & -120.18 ± 53.1 \\
\multirow{2}{*}{parkinson} & DRSL & -5.47 ± 7.1 & -14.30 ± 5.9 & -15.88 ± 10.8 & -18.83 ± 4.7 & -10.07 ± 5.1 \\
 & MLE & -3.81 ± 11.1 & -16.13 ± 9.9 & -20.47 ± 33.9 & -21.20 ± 13.3 & -10.57 ± 9.9 \\
\multirow{2}{*}{sdd} & DRSL & 0.62 ± 42.7 & -95.17 ± 39.1 & -65.48 ± 37.9 & -95.49 ± 39.4 & -86.71 ± 39.5 \\
 & MLE & -3.83 ± 118.9 & -55.58 ± 116.6 & -50.20 ± 797.6 & -57.80 ± 120.0 & -53.03 ± 117.9 \\
\multirow{2}{*}{superconduct} & DRSL & 59.43 ± 49.6 & -235.21 ± 88.3 & -880.96 ± 968.7 & -249.90 ± 59.7 & -101.63 ± 34.3 \\
 & MLE & 62.82 ± 52.2 & -384.21 ± 140.7 & -1480.38 ± 1632.5 & -394.99 ± 94.1 & -164.14 ± 54.3 \\ \hline
\multirow{2}{*}{Average} & DRSL & 0.19 & \textbf{-107.84} & \textbf{-271.62} & \textbf{-177.06} & \textbf{-62.51} \\
 & MLE & \textbf{1.17} & -130.83 & -402.59 & -204.81 & -70.81 \\ \hline
\multirow{2}{*}{\#Wins} & DRSL & 1 & \textbf{6} & \textbf{6} & \textbf{5} & \textbf{4} \\
 & MLE & \textbf{8} & 3 & 3 & 1 & 2 \\ \hline
\end{tabular}
\label{tab:mixmg-ll}
\end{table*}



In this section, we present empirical evaluations of the proposed method in Section \ref{sec:method} for learning distributionally robust probabilistic models in continuous domains. 


\subsection{Experiment Setup}
We consider nine real-world datasets in our experiments. One of them is the MNIST image dataset~\citep{lecun1998gradient}, while the other eight datasets are selected from the UCI machine learning repository~\citep{Dua:2019}. 
Following \citet{uria2016neural}, we preprocess all UCI datasets by eliminating discrete valued features and one of the attributes from every pair of attributes whose Pearson correlation coefficient is greater than $0.98$. 
For the MNIST dataset, we train a variational auto encoder~\citep{kingma2013auto} and embed each input image as a 20 dimensional feature vector in a structured hidden Gaussian space~\footnote{The encoder and decoder architecture are based on convolutional neural networks (CNNs).}. 
All datasets were normalized by subtracting the mean and then dividing by the standard deviation. The number of instances and features for each dataset after preprocessing is shown in Table~\ref{tab:dataset}.
Note that for the eight UCI datasets, the train/test split is not defined from the data source, and we randomly chose 85\% of the data instances for the training split and the remaining were used to form the test split. We further set aside 20\% of the training instances for validation purposes.

We consider the following two types of probabilistic models in our experiment.
\begin{enumerate}
\item 
Mixture of Multivariate Gaussian (MixMG), which serves as a standard benchmark model. The number of mixture components is treated as a hyper-parameter and is automatically tuned from the range of three to nine. 
Note that fitting a MixMG model on weighted data can be conducted efficiently using the EM algorithm where the solution for the E-step and M-step are still in closed form~\citep{legeleux2022gaussian}.  

\item
NN-GBN model proposed by \citet{dong2022conditionally}~\footnote{The model introduced by the authors remains unnamed in their publication, and for the purposes of this study, we will refer to it as NN-GBN.}, which models the full joint distribution as the product of a local, complex distribution over a small subset of variables and a fully tractable conditional distribution whose parameters are controlled using a neural network. We choose this model for case study simply because (1) we are interested in continuous domains; and (2) NN-GBN can be easily adapted for parameter learning on weighted data by adding an extra weight term into the original loss function and return the weighted negative loglikelihood as the loss. 
We tune the following two hyper-parameters for NN-GBN: (1) the maximum learning rate from the set $ \{10^{-2}, 3.3\times 10^{-3}, 10^{-3} \}$; and (2) the weight decay from the set $\{10^{-3}, 10^{-4} \}$. Additionally, we employed the OneCycleLR scheduler with cosine decay in PyTorch~\citep{paszke2019pytorch} to manage the learning rate. All other training configurations remain unchanged and align with what used in the original code and paper.

\end{enumerate}



\begin{figure*}[t]
    \centering
    \includegraphics[width = 6 in]{figs/both.pdf}
    \caption{The average amount of loglikelihood improved through DRSL robust learning for both (a) MixMG and (b) NN-GBN models in all five test scenarios.} 
    \label{fig:ll-gain}
\end{figure*}



For each type of the probabilistic model, we first learn a robust model through the algorithmic methodology described in Section \ref{sec:method}, and then compare its performance against the standard model learned using the MLE framework. 
We conduct 150 iterations of learning and adversarial steps, and the optimal hyperparameter is selected based on the model that has highest log-likelihood achieved on the validation set.
In addition, following \citet{hu2018does}, we set the maximum amount of distribution shift $\delta = 0.5$. 

As discussed in Section~\ref{sec:practical}, conducting the learning step with full iterations is inefficient and unnecessary. Therefore, we only perform one training epoch of the neural network per learning step for the NN-GBN model. To encourage convergence, we initiate the neural network with 50 training epochs on unweighted training data to prevent situations in which inaccurately estimated log-likelihood negatively affects the early adversarial iterations, potentially resulting in poor weight assignments for our data. Such poor weights could adversely impact the learning process and ultimately lead to divergence.

All experiments were conducted on a workstation equipped with a 16-core Intel Xeon Gold 6130 CPU and two Quadro P5000 GPUs. The datasets and codes used in the experiment are publicly available on Github~\footnote{\href{https://github.com/LeonDong1993/UAI2024-RobustLearning}{UAI2024-RobustLearning}}.



\begin{table*}[t]
\renewcommand{\arraystretch}{1.0}
\centering
\caption{Loglikelihood scores, average test LL scores and number of wins (ties are ignored) of robust and standard NN-GBN models on various test sets and neighboring regions. The robust model achieved higher average LL scores on all test cases including the original uncorrupted test set.}
\vspace{0 pt}
\tabcolsep 3pt
\begin{tabular}{ccccccc}
\hline
Dataset & Method & Original Test & Gaussian Test & Jittter Test & Worst NB & Average NB \\ \hline
\multirow{2}{*}{airquality} & DRSL & -1.16 ± 9.3 & -26.42 ± 68.9 & -218.76 ± 3270.2 & -14.17 ± 38.1 & -3.75 ± 11.2 \\
 & MLE & -1.23 ± 8.3 & -25.50 ± 64.1 & -311.49 ± 5217.5 & -13.42 ± 28.2 & -3.65 ± 9.1 \\
\multirow{2}{*}{energy} & DRSL & 0.99 ± 10.4 & -66.97 ± 63.1 & -142.42 ± 179.0 & -41.02 ± 40.0 & -5.94 ± 10.6 \\
 & MLE & 0.99 ± 10.8 & -67.96 ± 63.7 & -142.68 ± 175.2 & -41.96 ± 41.7 & -6.02 ± 10.9 \\
\multirow{2}{*}{hepmass} & DRSL & -26.81 ± 5.4 & -27.92 ± 6.3 & -26.52 ± 4.6 & -28.44 ± 6.2 & -26.91 ± 5.4 \\
 & MLE & -26.81 ± 5.4 & -27.92 ± 6.3 & -26.50 ± 4.6 & -28.44 ± 6.2 & -26.91 ± 5.4 \\
\multirow{2}{*}{miniboone} & DRSL & -26.30 ± 17.9 & -48.40 ± 20.0 & -78.54 ± 101.3 & -38.91 ± 21.2 & -28.43 ± 17.2 \\
 & MLE & -26.30 ± 17.9 & -48.40 ± 20.0 & -78.54 ± 101.3 & -38.91 ± 21.2 & -28.43 ± 17.2 \\
\multirow{2}{*}{mnist} & DRSL & -10.33 ± 6.2 & -13.22 ± 6.0 & -13.01 ± 5.5 & \textit{N/A} & \textit{N/A} \\
 & MLE & -10.31 ± 6.2 & -13.20 ± 5.8 & -13.02 ± 5.5 & \textit{N/A} & \textit{N/A} \\
\multirow{2}{*}{onlinenews} & DRSL & -19.43 ± 63.6 & -44.63 ± 129.2 & -39.19 ± 88.0 & -35.45 ± 114.5 & -23.11 ± 66.2 \\
 & MLE & -20.40 ± 107.3 & -59.34 ± 338.5 & -44.96 ± 161.9 & -50.66 ± 283.8 & -25.81 ± 115.1 \\
\multirow{2}{*}{parkinson} & DRSL & -5.05 ± 6.9 & -16.89 ± 9.7 & -25.87 ± 50.2 & -12.21 ± 8.3 & -6.20 ± 6.7 \\
 & MLE & -5.07 ± 7.4 & -17.69 ± 10.6 & -28.35 ± 53.0 & -13.02 ± 9.5 & -6.32 ± 7.2 \\
\multirow{2}{*}{sdd} & DRSL & -20.40 ± 490.1 & -56.76 ± 491.1 & -39.50 ± 514.8 & -52.93 ± 493.2 & -34.04 ± 489.9 \\
 & MLE & -36.21 ± 1922.4 & -66.12 ± 1907.2 & -46.96 ± 1381.2 & -64.81 ± 1934.7 & -48.56 ± 1921.7 \\
\multirow{2}{*}{superconduct} & DRSL & 40.44 ± 45.4 & -219.83 ± 105.0 & -756.24 ± 702.0 & -56.08 ± 59.6 & 7.11 ± 37.2 \\
 & MLE & 43.79 ± 45.6 & -208.76 ± 102.9 & -744.03 ± 702.2 & -48.62 ± 62.2 & 16.01 ± 41.4 \\ \hline
\multirow{2}{*}{Average} & DRSL & \textbf{-7.56} & \textbf{-57.89} & \textbf{-148.90} & \textbf{-34.90} & \textbf{-15.16} \\
 & MLE & -9.06 & -59.43 & -159.61 & -37.48 & -16.21 \\ \hline
\multirow{2}{*}{\#Wins} & DRSL & \textbf{4} & \textbf{4} & \textbf{5} & \textbf{3} & \textbf{3} \\
 & MLE & 2 & 3 & 2 & 1 & 1 \\ \hline
\end{tabular}
\label{tab:cnet-ll}
\end{table*}




\subsection{Adversarial Generative Performance}
We evaluated the generative performance of the robust model against its standard counterparts for both MixMG and NN-GBN models, by comparing their average loglikelihood~\footnote{Generally speaking, loglikelihood is not a good metric for evaluating a model's generative performance in continuous domains because it can be unbounded~\citep{dong2022conditionally}, unless the models being compared are in the same parametric family (which is the case for us). } achieved on the original uncorrupted test set as well as two additional adversarial test sets for each dataset. 
To be more specific, we create two types of adversarial test sets for each dataset as follows.
\begin{enumerate}
    \item Adding Gaussian noise to the original test set, the Gaussian distribution used in our experiment is with zero mean and standard deviation one. In addition, we clip all the noise values to the interval $[-0.2, 0.2]$ to simulate the scenario where we have minor or medium level perturbation of the data.
    \item Jitter the input by setting some entries to a random number~\citep{su2019one}. In our experiment, we randomly pick 20\% of the input entries and assign a value uniformly sampled from the interval $[-0.2, 0.2]$. This is considered to be a harder task because the amount of  value change can be far greater than the previous case. 
\end{enumerate}

We also investigated how the learned distribution behaves around the test points:  we first randomly sampled 500 instances around each test point and assessed the loglikelihoods of these neighboring data points. Subsequently, we identified the instance with the lowest loglikelihood among these neighbors (denoted as the `Worst NB') and computed the average loglikelihood for these 500 points (denoted as `Average NB'). These two resulting values constitute the neighbor metrics for the test point. We repeated this process and calculated the average of these two metrics over all test points within each dataset.
Note that for the MNIST dataset, because the inputs are embedding vectors from the trained variational autoencoder, the neighboring vectors may not correspond to any real image inputs. For this reason, we exclude the results of MNIST neighboring data from the loglikelihood calculation and the summarization process.


We report the test set loglikelihoods and corresponding standard deviations achieved for MixMG and NN-GBN in Tables~\ref{tab:mixmg-ll} and \ref{tab:cnet-ll}, respectively. Furthermore, we have summarized the performance improvements regarding the average loglikelihoods of the robust model compared to its standard counterparts in Figure~\ref{fig:ll-gain} for all five types of testing scenarios.
Note that for the NN-GBN model, both robust and standard models achieved identical results for the miniboone dataset. This consistency arises from our approach of continuously monitoring the models' performance on the validation set at each iteration and retaining the model in its optimal state. For the miniboone dataset, which is particularly susceptible to overfitting, both training methods delivered their best performance during the early pre-training iterations. 

From these results, we have the following observations. Firstly, the model trained with the DRSL framework consistently outperformed its counterpart in terms of the average loglikelihood score for both adversarial test sets, as shown in Figure~\ref{fig:ll-gain}.
We noted a particularly intriguing result: models trained using the DRSL framework also achieved higher or similar average loglikelihood scores on the original, uncorrupted test set at times. 
Several factors may contribute to this phenomenon: (1) the model learned through DRSL framework shapes the distribution more effectively, rather than spreading densities over neighbors, which tends to lead to lower loglikelihood on the original test set in practice; (2) the model tuning process prioritizes performance on the original validation set, emphasizing the importance of focusing on both adversarial examples and the original data.

Secondly, we observed that DRSL exhibited more substantial improvements on the jittered adversarial data compared to the Gaussian adversarial test set. This suggests that distributionally robust learning offers higher tolerance for challenging corruptions such as measurement errors.

Lastly, from the results on the two neighboring metrics, it becomes evident that the robust model excels in shaping a distribution that exhibits a high degree of smoothness around real data points.  In practical terms, this suggests that DRSL effectively captures the underlying data distribution, ensuring that it doesn't just account for isolated, exceptional cases, but rather models the broader data context more comprehensively and robustly. This characteristic contributes to its superior performance across the different testing scenarios considered in our experiments.
We additionally note that our robust model has shown more significant improvements on the worst neighbor metric compared to the average neighbor metric. This observation closely aligns with our earlier findings, where the performance enhancement in the jitter test set is higher when compared to the Gaussian test set.




% ------------------------------ no longer used 


% 1. Model trained with DRSL framework achieved higher LL in both two adversarial test sets on average as shown in figure 1. 
% 2. More important,  Model trained with DRSL framework also achieved higher test LL on the original uncorrupted test set on average as well. This is a very interesting finding and we suspect it is because that (1) we are using DRSL framework, which can better shape the distribution, instead of just spreading the densities over neighbors and thus have lower test LL on original test set in practice. [Think more explanation if you can.]   (2) we are tuning on the best performance on original validation set, emphasize the model to focus on not only on adversarial example but also original data as well.
% 3. Compared to Gaussian adversarial test set, DRSL have more improvement on the jittered  adversarial test set. We believe it is because that distributional robust learning are not only optimizing over only the neighbors, and therefore we can be highly tolerant for more challenging corruption like measurement error.
% 4. We have exact same results achieved for miniboone dataset for both models, this is because keep the best iteration that have the highest LL on validation set. For this miniboone dataset, it is easy to be over-fitted and both training method agree on the early state of the model (before adversarial step even start to work, i.e. in the first ). 