
\input{tab_list}

\section{Methodology}
\label{sec:method}

\mypara{Probabilistic Models}
\label{sec:pmodels}
To evaluate the transformations in \NAME, we obtain a set of \totalprogs{}
probabilistic programs from a popular repository~\cite{stanexamplemodels}
including 13 Regression models, 5 Time-Series/State-Space
models, and 6 Mixture models.
Table~\ref{tab:benchmarks} presents the details of all probabilistic programs we
evaluate using \NAME, their description, number of parameters and data items,
and run times (in seconds) for ADVI and NUTS inference algorithms
with Stan. The TimeSeries models start with ``S-'', Mixture
models with ``M-'', \mbox{and Regression models with ``R-''.}






\mypara{Automated Inference}
We use NUTS~\cite{hoffman2014no} and
ADVI~\cite{KucukelbirETALStanVariationalInference} inference algorithms from
Stan~\cite{carpenter2016stan} -- a popular probabilistic programming language --
to run each transformed program and compare their relative behaviors. For NUTS,
we run each program with 4 chains, 1000 warmup iterations, and 1000 sampling
iterations with a timeout of 8 minutes for each chain. We exclude the programs
that timed out. For ADVI, we use 10000 iterations and 1000 posterior samples for
comparison. For our evaluation, we use Azure VMs, each with 4 cores, 2.3 GHz
CPU, and 16 GB RAM.




\mypara{Robustness Metric}
We use the \textit{RIMSE} metric defined in
Section~\ref{sec:robustness_metrics}. For each model, we repeat generating noise
and inference 5 times and compute the geometric mean of \textit{RIMSE} scores.
We chose $\textit{RIMSE}$ instead of other information criteria used in
\cite{gelman2013bayesian} for model selection because there are several
challenges for applying them on general probabilistic programs:
AIC~\cite{akaike1974new} does not work under strong priors;
DIC~\cite{ando2010bayesian} gives poor results when the distributions are not
well summarized by mean; WAIC~\cite{watanabe2010asymptotic} and
Cross-Validation~\cite{stone1974cross} require data partitioning, which 
is hard to automate for structured models; Bayes factor
method~\cite{kass1995bayes} only works well \mbox{for discrete models
\cite{mcelreath2020statistical}.}


\mypara{Convergence Metric} The sampling-based automated inference may suffer
from non-convergence and result in inaccurate estimation of the result. \Robts
that introduce new parameters can make the program harder to
converge and thus affect its accuracy and robustness.
Hence, for evaluation, we also measure the convergence score using Gelman-Rubin
Diagnostic~\cite{gelman2013bayesian}. A score significantly larger than 1
indicates non-convergence.





\mypara{Noise Models}
\label{sec:attacks}
The dataset $\mathcal{D}$ is usually composed of response data
(labels) and explanatory data (features) with the same length. In this
work, we only add noise to the response data.  We select five noise
levels for the fraction of perturbed data inputs between $2\%$, $4\%$, $6\%$,
$8\%$, and $10\%$. Hereon we denote the response data as \mbox{$\boldsymbol{y}$ and its size as $D$:}

\noindent$\bullet$\mypara{\bf{} Adding Outliers} We randomly select a subset of data points and
add random noise to them. Let  $\textit{sd}(\boldsymbol{y})$ be standard
deviation estimated from the original dataset $\boldsymbol{y}=\{y_1,y_2,\ldots
,y_D\}$, then we simulate the outliers by:
\begin{align*}  
 z_{i=1\ldots D} &\sim \textit{Bernoulli}(k\%) \\ 
y_i^\textit{Outliers} | z_i = 1 & \sim \mathcal{N}(c\cdot y_i , |y_i|\cdot \textit{sd}(\boldsymbol{y}))
\end{align*}
where $k\%$ corresponds to the amount of noise, can be specified by the user. %
The constant $c > 1$ allows us generate outliers far from the typical
observations. In our experiment, we let $c = k$. This noise model  simulates
a scenario where some observations get corrupted due to some \mbox{exception or
failure (e.g., of a sensor, storage, or network)}.
\noindent$\bullet$\mypara{\bf{} Introducing Hidden Groups} This strategy introduces a hidden group
(with its own mode) that does not agree with the modeling assumptions. The
location and scale of the hidden group is controlled by the noise level $k$.
Similar to previous case, we allow the user to specify the size of data subset
to be changed (e.g., our experiments use $c=20\%$). %
{%
\begin{align*}  
z_{i=1\ldots D} &\sim \textit{Bernoulli}(c) \\ %
y_i^\textit{Hidden\_Group} | z_i = 1 &\sim \mathcal{N}(y_i + \frac{k}{2} \cdot \textit{sd}(\boldsymbol{y}), 0.1k)
\end{align*} 
}
\noindent$\bullet$\mypara{\bf{} Skewing data} Using this strategy, we skew the distribution of data
points. Skewing causes the mean of the distribution to shift and lose symmetry.
It also makes it harder for the inference strategy to sample using a non-robust
model. In this work, we apply positive skew to the datasets -- for data
$\boldsymbol{y}$ we generate skewed data \mbox{as follows:}

  \vspace{-.2in}
{%
\begin{align*}
y_i^\textit{Skewed} = &\left(\frac{y_{i}-y_\textit{min}}{y_\textit{max}-y_\textit{min}}\right)^{(1+0.1k)}\!\!\cdot\!(y_\textit{max}-y_\textit{min})+y_\textit{min}
\end{align*}
}
\vspace{-.2in}

\noindent{}where $k$ is the noise level for this noise model. We first scale all the data
to $[0, 1]$, then we raise it to a chosen power to skew the data, and finally
scale it back to $y_\textit{min}=\min_{i=1\ldots D} y_i$ and $y_\textit{max}=\max_{i=1\ldots
D} y_i$.

A reproducible version of \NAME  is available at
\url{https://figshare.com/s/38668524113696505ef4}.
