\vspace{-.1in}
\section{Evaluation}
\label{sec:eval}


\vspace{-.05in}
\subsection{Performance of Transformations} %
\label{sec:rq1}
\vspace{-.05in}

We apply the following transformations (discussed in
Section~\ref{sec:transform}): Robust reweighting data (\emph{Reweight}),
Localization of location parameter (\emph{Local-Loc}), Localization of scale
parameter \mbox{(\emph{Local-Scale})}, Reparameterization and Localization of scale
parameter (\emph{Reparam}), Normal to StudentT (\emph{StudentT}) and
\mixture (\emph{Mixture}).
To evaluate the transformations using \NAME, we apply 3 different
noise models (Outliers, Hidden Groups, and Skewing) on the datasets
obtained from \totalprogs programs.


\mypara{General Trends of Different Transformations}
Figures~\ref{fig:avg_imp_at_noise_level_advi} and
\ref{fig:avg_imp_at_noise_level_nuts} present the geomean of the 
relative improvement of
MSE (RIMSE) by different robustness transformations at different noise levels
for ADVI and NUTS algorithms respectively. Each sub-plot presents the results
for one noise model. The X-axis represents the noise level while the Y-axis
represents the \emph{geometric mean} of RIMSE over all programs. Each line in
the plots represent the performance of one transformation. We also present the
MSE  of the original (non-robust) program at all noise
levels below the X-axis (as ``Orig MSE"). The
robust transformations reduce MSE by the factor
represented on the Y-axis (e.g., up to 3.31x for StudentT transformation for
Outliers (ADVI)).
\begin{figure*}[!htb]
  \begin{minipage}{0.6\linewidth}
\begin{minipage}{\textwidth}
  \centering
   \includegraphics[trim={0 0 0.1 .2in},clip,width=1.0\linewidth]{plots/avg_imp_line_conv_true.png}
\caption{Mean Improvement of Transformed Programs at Different Noise Levels (ADVI)}
\label{fig:avg_imp_at_noise_level_advi}
\end{minipage}
\begin{minipage}{\textwidth}
  \centering
   \includegraphics[trim={0 0 0.1 .2in},clip,width=1.0\linewidth]{plots/avg_imp_line_conv_false.png}
   \caption{Mean Improvement of Transformed Programs at Different Noise Levels (NUTS)}
\label{fig:avg_imp_at_noise_level_nuts}
\end{minipage}
\end{minipage}
\hspace{0.5cm}
\begin{minipage}[h]{0.35\linewidth}
    \scriptsize\centering  
    \captionof{table}{MSE Improvement at Noise Level~10 (Outliers)}
    \label{tab:mse_imp_table_outliers}
    \setlength{\tabcolsep}{2pt}
    \begin{tabular}{r|rl|rl}
    \toprule
    Prog & \multicolumn{2}{|c|}{ADVI} & \multicolumn{2}{|c}{NUTS} \\
    \midrule
  RE &          256.42 &       (StudentT) &          412.60 &       (StudentT) \\
  RV &           28.04 &       (StudentT) &           31.94 &        (Reparam) \\
  MC &           27.48 &         (Local1) &            1.00 &       (Original) \\
  SE &           14.23 &       (StudentT) &           16.02 &       (Reweight) \\
  RK &            8.41 &       (StudentT) &            9.25 &        (Reparam) \\
  RN &            7.11 &        (Reparam) &            6.25 &         (Local2) \\
  RU &            3.42 &       (StudentT) &            3.75 &       (StudentT) \\
  RA &            3.31 &       (StudentT) &            3.19 &        (Reparam) \\
  MF &            3.27 &       (StudentT) &            2.81 &        (Reparam) \\
  RQ &            3.23 &       (StudentT) &            3.78 &       (StudentT) \\
  RR &            2.95 &        (Reparam) &            3.00 &        (Reparam) \\
  RX &            2.93 &       (StudentT) &            3.18 &        (Reparam) \\
  SD &            2.52 &       (StudentT) &            3.52 &       (StudentT) \\
  MD &            2.21 &       (Reweight) &            6.08 &       (Reweight) \\
  ME &            1.27 &       (StudentT) &            1.41 &        (Reparam) \\
  RY &            1.25 &       (StudentT) &            1.00 &       (Original) \\
  MB &            1.14 &       (StudentT) &            1.22 &       (StudentT) \\
  RG &            1.04 &       (StudentT) &            1.03 &       (Reweight) \\
  SA &            1.02 &        (Mixture) &            1.56 &        (Reparam) \\
  RW &            1.00 &       (Reweight) &            1.00 &       (Reweight) \\
  SB &            1.00 &       (StudentT) &            1.00 &       (Original) \\
  SC &            1.00 &       (Original) &            1.05 &         (Local1) \\
  RL &            1.00 &       (Original) &            1.00 &       (Original) \\
  MA &            1.00 &       (Original) &            1.68 &       (StudentT) \\
    \midrule
    \multicolumn{5}{c}{(R-): Regression, (M-): Mixture, (S-): TimeSeries}\\
    \bottomrule
    \end{tabular}
\end{minipage}
\end{figure*}


Overall, the transformations are most effective for the \emph{Outliers} noise
model. The improvements are significantly smaller for \emph{Hidden Group}. For
\emph{Skewed Data} noise model, none of the transformations are effective
because the noisy samples are harder to distinguish from typical
observations. In general, RIMSE increases with higher noise level, showing
that the transformations are more helpful when there is more corruption in the
data.

\insight{Our results show that most transformations do not generalize well
beyond the Outliers noise model and provide limited benefits. Hence, there is a
need to develop novel \robts, especially for Hidden Group and Skewed Data noise
models.}

For the Outliers and Hidden Group noise models, \emph{StudentT} transformation
is the best in most cases, closely followed by \emph{Reparam}.
However, Reparam requires inferring many more parameters
than StudentT (e.g., for $D$ data points, StudentT transformation adds one more
parameter while Reparam adds $D+1$ auxiliary parameters), which increases the
run time of inference (see also RQ3).


The Local-Loc and Local-Scale transformations provide less protection from noisy
data. Local-Scale may help improve the accuracy with NUTS, but it is likely to
diverge when using ADVI (Table~\ref{tab:rhat}), leading to inaccurate results.
One potential cause for this may be
that we infer the hyper-parameters for localization transformations in the
Bayesian model using automated inference along with other parameters.
It may be possible to obtain a
better result by applying the E-M algorithm proposed
in~\cite{wang2018general}, which is customized for each model.
However, it is unclear how to automatically apply such an algorithm \mbox{for general
probabilistic programs.}

  












\vspace{-.1in}
\subsection{Predictive Accuracy Improvement}
\vspace{-.07in}

\label{sec:rq2}
Table~\ref{tab:mse_imp_table_outliers} presents the RIMSE scores for all
programs with the Outliers noise model at noise level 10 for ADVI and NUTS. Each
row represents one program. Each column presents the largest improvement of MSE
and the name of the transformation that enabled this improvement in parentheses.
For example, "256.42 (StudentT)" means that StudentT is the best among all
the transformations (and the original one) and yields 256.42x reduction of
the MSE of the original program. For the RIMSE scores with
other noise models, see Appendix~D. A larger value means the
posterior obtained by the transformation is closer to the posterior based on the
non-noisy data. ``Local1" stands for Local-Loc and ``Local2" stands for
Local-Scale.  We do not apply Hidden Group noise model on Mixture models and
Skewed Data noise model on programs with binary data since they are unsuitable.
In summary, when using ADVI, StudentT provides the best improvement on 15
benchmarks, followed by Original which is the best on 3 benchmarks, while the
other transformations only lead on fewer than 3 benchmarks each. When using
NUTS, Reparam is the best on 8 benchmarks; StudentT is the best on 6 benchmarks;
Reweight and Original both dominate 4 benchmarks; \mbox{Local1 and Local2 both
dominate one.}


\mypara{Characteristics for Different Model Categories}
Generally, Regression (R-) models show the largest improvement in robustness, while
Time-Series models (S-) show the smallest improvement.
We observe substantial improvements in most linear regression models (e.g. RE
and RV), since most transformations are designed for such models. However, for a
logistic regression model (RW), we observe very small improvements (up to
1.00x). This is because this model already has a high tolerance for noise
compared to other models, since the noise is limited between 0 and 1 for binary
data, and thus makes most transformations redundant.

Most Time-Series models model the auto-correlation within data points or fit a
correlation matrix for Gaussian processes. 
As a result,
small noise in the data may not
affect the fitted correlation. For instance, we observe that the MSE scores of the
original models of SA and SB are not affected as the noise level increases.
Further, since the \robts are generally designed for exchangeable data
\citep{Wang:2017}, they are unlikely to work well for many Time-Series models.
For instance, for models SC and SD, the transformations are not as effective as
other models. Unlike other time-series models, the model SE does not model the correlation
but describes a regression equation between the past and current observations
and thus can benefit more from the robustness transformations.

Mixture models are less robust to outliers than other classes, because they
require fitting a large number of parameters, i.e. the locations, scales, and
the probabilities of multiple groups. Several
\robts could help fit the locations correctly, however, they tend to classify outliers into
one of the groups and infer a less accurate scale or probability, which is the
case for models MD, MB, and ME. Also, mixture models are more expensive to fit
(due to the label switching problem \cite{stanmanualmixturehardness}) and thus are
likely to diverge when they are not robust to outliers. We observe that at the
noise level 10, for the original mixture models, the geomean of the convergence
score is 8.94, which is much larger than that of all the original models (2.09).
As a result, models like MC and MD can occasionally show a large improvement (up
to 27.48x and 6.08x) when the original model diverges due to outliers but the
transformed model converges to correct result.





\insight{Overall, we observe the transformations are most useful for most regression
models. However, for most time-series models and some classes of regression models, the benefits
of transformations are limited since they are already tolerant to input noise.
For mixture models, due to the model complexity,
transformations generally have less protection against noise,
however, they might occasionally protect the original model from divergence.
}


\begin{table}[!t]
  \caption{(Geometric-)Mean of Convergence Score (Gelman-Rubin Diagnostic) at Noise Level 10}
  \label{tab:rhat}
\centering
  \scriptsize
  \setlength{\tabcolsep}{1pt}
 \begin{tabular}{l|rr|rr|rr}
  \toprule
    \textbf{Transformations} & \multicolumn{2}{c|}{\textbf{Outliers}} & \multicolumn{2}{c|}{\textbf{Hidden Group}} & \multicolumn{2}{c}{\textbf{Skewed Data}} \\ \midrule
  & ADVI & NUTS & ADVI & NUTS & ADVI & NUTS\\
\midrule
Original & 2.12 & 1.60 & 1.25 & 1.00& 2.15 & 1.19\\ 
Reweighting & 1.40 & 1.15 & 1.20 & 1.01& 1.36 & 1.06\\ 
Localized-Loc & 3.97 & 1.44 & 2.13 & 1.20& 5.07 & 1.31\\ 
Localized-Scale & 2.77 & 1.23 & 1.88 & 1.05& 5.48 & 1.19\\ 
Reparam-Local & 2.15 & 1.34 & 1.29 & 1.13& 2.27 & 1.25\\ 
StudentT & 1.69 & 1.36 & 1.15 & 1.05& 1.87 & 1.35\\ 
Cont. Group Mixture & 9.41 & -- & 9.53 & --& 9.94 & --\\ 
\bottomrule
\end{tabular}
\vspace{-0.1in}
\end{table}


\input{tables/evaltables.tex}

\mypara{Convergence of Transformed Models under Noise}
Table~\ref{tab:rhat} shows the geometric mean of convergence scores by different
transformations averaged by the models with three noise models at noise level 10
when running with ADVI and NUTS. We present the convergence scores at noise
levels 2 and 6 in Appendix~E.




The convergence score with NUTS is generally better than that with ADVI: for
all the transformed programs, the geometric mean of the convergence score for
NUTS is 1.33, while for ADVI it is 2.74. We observe that StudentT generally has
better average convergence score than Reparam with ADVI. For example, at noise
level 10 with Outliers attack, the average convergence score for Reparam is 2.15
while for StudentT it is 1.69 (the lower the better). When using NUTS, the
heavy-tailed nature of StudentT can make sampling less
efficient~\citep{stanmanualreparam}. Hence, StudentT transformation has \emph{worse}
convergence score with NUTS than with ADVI. This also explains why StudentT has
slightly higher RIMSE than Reparam when using ADVI, while their RIMSEs are
similar using NUTS.




The Local-Loc and Local-Scale transformations introduce a strong dependency
between the original parameter and new parameters, as described in
\cite{gorinova2019automatic}. This creates a complex posterior geometry, which
is difficult for both algorithms to explore \citep{stanmanualreparam}. For
instance, for ADVI, at noise level 10, the geometric mean of the convergence
score over all models for Local-Loc is 3.69 while for Local-Scale it is 3.18.
NUTS does not work well with mixture models (including the Mixture
transformation) \cite{stanmanualmixturehardness}. For ADVI, \emph{Cont. Mixture}
provides best improvements only for two models. 
Finally, good convergence may not necessarily lead to high accuracy. For
example, the Reweighting transformation obtains the best convergence score but
only provides the best improvement for four models. 

\insight{The performance of a transformation depends on both the inference
algorithm and the convergence quality.}



\vspace{-.1in}
\subsection{The Overhead of Robustness}
\label{sec:rq3}

\mypara{Overhead of Transformations for Different Model Categories}
Table \ref{tab:overhead_3cat} presents the time overhead for
different transformations using ADVI and NUTS (over the original program). We divide the benchmarks into
three categories: generalized linear models (GLM), Time-Series (TS) and Mixture
Models (Mix). The overhead is calculated by dividing the run time of a
transformed model by the run time of the original model, and then computing
geometric mean over all the benchmarks in the corresponding category. For
instance, applying Reweight on GLM is 2.25x times (on average) slower than
running the original program with ADVI. NUTS generally has a
higher overhead than ADVI. Also, for Mixture Models, the transformations incur
the largest overheads among the three model categories, followed by GLM, and
Time-Series. 
The significant increase in execution time for Mixture Models is because the
transformations add additional dependency between the parameters in these
models, making inference more difficult and slow. On the other hand, since
Time-Series Models already have strong dependencies between the parameters, the
    robust transformations do not \mbox{affect their execution times much}.











\mypara{Trade-off of Time vs Performance}
\label{sec:trade}
We evaluate how the choice for best transformation changes (based on posterior
predictive accuracy) when the user has limited time budget. For this experiment,
we consider different overhead time budgets (from 1x to $\infty$). For each
budget, we filter out transformations that exceed the budget and choose
the best transformation among the rest. Figures~\ref{fig:ovheads_advi} and
\ref{fig:ovheads_nuts} present the results for ADVI and NUTS respectively, for
the Outliers noise model. The X-axis represents time budgets.
 The Y-axis represents percentage of
models for which a transformation obtained the best improvement in predictive
accuracy. Each line shows the mean across all noise levels for either a
\mbox{transformation or the original model}.

For lower time budgets (1-3x), the transformations often produce
unacceptable execution overheads, which makes the original model more preferable
than the transformed models, especially for NUTS. 
For ADVI, we observe that StudentT consistently dominates other transformations
across all overhead budgets, while Reparam and Reweight assume the second place
in most cases and yield best results for similar number of cases.
For NUTS, StudentT and Reweight provide better gains than Reparam for overhead
budgets of up to 10x. However, for higher budgets, Reparam dominates Reweight
and shows closer performance to StudentT. 

\insight{Overall, since we observe a larger variance of overheads for NUTS, the
users should carefully select a robustness transformation based on the maximum
tolerable execution overhead in their applications.}













  



