\vspace{-.04in}
\section{Introduction}
\vspace{-.04in}

Probabilistic programming (PP) has recently emerged as a general and flexible
approach for Bayesian
inference~\citep{goodman2008church,carpenter2016stan,bingham2018pyro}. PP
decouples model specification from the inference procedures, and thus allows the
users to update their models while automatically applying general inference
algorithms for Markov Chain Monte Carlo (MCMC)
sampling~\citep{robert2013monte} or Variational Inference
(VI)~\citep{beal2003variational}. In recent years, PP
has been applied in various real-world machine learning applications, e.g.,
forecasting~\citep{taylor2018forecasting}, recommendations in social networks
and predicting user locations~\citep{ai2019hackppl,gokkayabayesian}, rating
players in games~\citep{gordon2014probabilistic}, \mbox{and COVID-19
modeling~\citep{bherwani2021understanding}.}

Automatically deploying probabilistic programs on such a diverse set of
real-world applications raises the question of how much the results of their
inferences change in presence of outliers and other deviations of data from
model's assumptions (which we summarily call \emph{noise}). \emph{Robustness} is
the property of systems, including probabilistic programs, to remain unaffected
by data noise~\citep{huber1981robust}. For instance, many statistical models
assume a Gaussian prior or likelihood, but few data points that are far away
from the rest can significantly change the inferred mean. In
contrast, inference using robust models is more likely to yield posteriors that
are not affected by such noise.










\Robts have been traditionally custom-designed for specific models, such as
linear regression. Some common \trans can however be applied across different
model classes, for instance, by replacing Gaussian likelihood with Student-T.
 However,
it remains unknown (1) which \robt to apply to obtain most robust inference
result for a given model and noise model, and (2) how off-the-shelf inference
algorithms in popular PP languages interact with these \trans -- e.g., which
ones have higher execution overhead. Most previous works consider only a
few examples, without any thorough comparisons or systematic run time measurements.
However, these questions become particularly important when the models are deployed
in real-world settings where both 
accuracy (inference results)
and execution cost matter.






\input{example}


\mypara{\bf Our Work} The goal of our work is to develop a \emph{systematic
understanding} of how various \robts perform in different
scenarios through rigorous empirical evaluation on a broad range of subjects.
We study the impact on the performance (accuracy) and execution cost
of \emph{four factors}: (1) Inference
Algorithms, (2) Noise Models and Noise Levels, (3) Model Class, (4) User time
budget.%


We present the first extensive study of different \robts on 24 probabilistic
models from three classes: generalized linear models, mixture models, and
time-series models. To help users understand both the practical and fundamental
properties of probabilistic \robts, we developed the \textbf{\NAME} framework.
\NAME automatically modifies the program code to apply the robust transformation
(and check for its legality) and  systematically evaluates different \robts for
user-defined input noise models and posterior accuracy metrics. \NAME{} then
ranks the transformed programs by predictive accuracy. \NAME{} is
\emph{extensible}: users can easily add new noise models, transformations, \mbox{and
accuracy metrics.}



We implemented three common \emph{noise models} for corrupting the datasets
(Section~\ref{sec:attacks}): (1) {\textit{Simple Outliers}} randomly changes the
value of several data points, (2) {\textit{Introducing Hidden Groups}} corrupts
the data by adding a new distribution mode, and (3) {\textit{Skewing Data}} adds
non-symmetric error to most data points to skew the distribution.
We also implemented five \robts from literature for each model (we describe them
in Section~\ref{sec:transform}): (1)~{\textit{Bayesian Data
Reweighting}}~\cite{Wang:2017},
(2)~{\textit{Localization}}~\cite{wang2018general},  (3)~{\textit{Robust
Reparameterization}} combines reparameterization from~\cite{stanmanualreparam}
with localization, (4)~\textit{StudentT} transformation of Gaussian variables, and
\mbox{(5)~{\textit{Contaminated Group Mixture}}~\cite{berger1994overview}.}

We analyze the posterior predictive accuracy of the \mbox{robustified} models
and their execution times using two \mbox{state-of-the}-art inference
algorithms: {No U-Turn Sampler (NUTS)} and Automatic Differentiation
Variational Inference (ADVI), implemented in
Stan~\cite{carpenter2016stan}.




\mypara{Results and Insights} Our study yields several interesting insights and
observations: 
\begin{itemize}[leftmargin=*,topsep=0pt,noitemsep]
\item Different inference algorithms respond differently to each \robt. For
instance, for Simple Outliers noise model, Student-T always performs better than
Reparameterization for ADVI but for NUTS, Reparameterization outperforms
Student-T.
\item \Robts can be effective for some noise models -- in particular for
Simple Outliers -- even when 10\% of the data has been replaced with outliers
the \robts reduce the error by up to 3x, compared to the original program on the
same data. However, most transformations do not generalize well across different
noise models. For instance, all transformations provide very limited benefits
for Hidden Group and Skewed data attacks – this motivates future research to
develop novel \robts for \mbox{these attacks.}
\item \Robts incur greater overheads for NUTS than ADVI. The run time overheads
(over original model) for ADVI range between 1.04x and 7.03x, while for NUTS
they are between 1.76x and 14.5x. Hence, some transformations may be impractical
in scenarios with tight time budgets. We present more insights \mbox{in
Section~\ref{sec:eval}.}
\end{itemize}










\mypara{\bf Contributions} This paper makes several contributions:
\begin{itemize}[leftmargin=*,topsep=0pt,noitemsep]
\item {\bf Automated Robustness Evaluation:} We develop \NAME, a novel automated
system that \CComment{allows us to automatically and }efficiently evaluates the
\robts for probabilistic programs.
\item {\bf Systematic Evaluation of Robustness:} We present an extensive study
  of 24 probabilistic programs with multiple \robts, input noise models, and
  inference algorithms. Our results inform how users select a \robt for their
  use cases.
\item {\bf Insights:} We demonstrate that the \robts can
  effectively improve predictive accuracy for some models of noisy data, but
  they may also incur significant execution time overhead. %
  Using \NAME, we obtained numerous useful insights that are beneficial for both
  the users and researchers of the probabilistic programming community in
  particular and AI in general. 
\end{itemize}

\NAME{} is open sourced at \url{https://github.com/uiuc-arc/astra}.






