
\vspace{-.05in}
\section{Robustness Metrics}
\label{sec:robustness_metrics}
\vspace{-.1in}


\input{tables/transformations}

To evaluate model robustness, we follow the standard methodology in existing
research on robust Bayesian modelling \cite{Wang:2017,wang2018general}, by
injecting noise in the observed data and computing the relative change in
posterior predictive accuracy of the model.



Given a probabilistic program $P$, and the observed dataset $y$ (we will also call it \emph{uncorrupted}), we
fit $P$ to a \emph{corrupted dataset} $y^\textit{Noise}$ that is generated by injecting noise
in $y$. We use the fitted posterior of $P$ to generate predicted
 data $\hat{y}$,
and we evaluate the \emph{robustness} of this program through the mean squared error (MSE)
metric, as

\vspace{-.25in}
\begin{equation}
\textit{MSE}(\hat{y}, y) = \frac{1}{D} \sum_{i=1}^D (\hat{y}_i -  y_i)^2, 
\end{equation}
\vspace{-.22in}

where $D$ is the size of the dataset. Intuitively, \textit{MSE} quantifies
by how much the posterior predictive accuracy changes
in presence of data corruptions. Computing \textit{MSE} using predictive
data is recommended by \cite{gelman2013bayesian} as the posterior predictive
check to evaluate model fitting and is also used by \cite{wang2018general} as
the predictive R2 metric to evaluate model robustness.




Since the value of $\textit{MSE}$ depends on the scale of data values, we
standardize the $\textit{MSE}$s on the original model, following
\citet{saad2019bayesian}. Specifically, let $\textit{MSE}(\hat{y}, y)$ and
$\textit{MSE}(\hat{y}_T, y)$ be the estimated robustness of the original model
$P$ and a transformed model $P_T$, respectively. Then we define the
\emph{relative improvement of robustness} of the transformed model as:

\vspace{-.25in}
\begin{align} \textit{RIMSE}(\hat{y},
    \hat{y}_T, y) = \textit{MSE}(
    \hat{y},y)\,/\,\textit{MSE}(
\hat{y}_T,y).\end{align}
\vspace{-.25in}

Intuitively, \textit{RIMSE} denotes the relative improvement of the
``robustified'' model over the original model.
\textit{RIMSE} $>$1 indicates improved robustness, \textit{RIMSE} of 1
indicates no improvement, and \textit{RIMSE} $<$1 indicates that the accuracy
of robustified model is lower than the original model. In our example
(Figure~\ref{fig:example}), the best transformation (Reweight) yields a RIMSE of
5.22, whereas the least useful transformation (Localization-Location) yields a
RIMSE of 1.31.








\input{transformations}









\section{\NAME}\label{sec:astra} 
    At a high level, \NAME takes a probabilistic program $P$, a
dataset $y$, 
the desired noise model $A$, inference algorithm $I$,
and a set of transformations $T$ to apply on $P$. \NAME first generates the
transformed programs by applying each transformation in $T$ to $P$. \NAME then
compares each transformed program against the original program and returns the
list of transformed programs and their corresponding robustness \emph{scores},
sorted in decreasing \mbox{order of their robustness.}

\subsection{Probabilistic Program Transformations}
\begin{wrapfigure}{r}{3cm}
\vspace{-.25in}
\footnotesize
\begin{lstlisting}[language=prob,basicstyle={\scriptsize\ttfamily},numbers=left,
framexleftmargin=-5pt,xleftmargin=0pt,escapechar=!]
data {
 int<lower=0> N; 
 vector[N] y;
}
parameters {
 real b;
} 
model {
 for (i in 1:N)
  y[i]~normal(b,1);
}
\end{lstlisting}
\vspace{-.2in}
\caption{Example PP}
\label{fig:example_program}
\vspace{-.08in}
\end{wrapfigure}


\mypara{Probabilistic Programs}
\NAME takes a probabilistic program (PP) in Stan probabilistic programming 
language~\cite{carpenter2016stan} as input, which is to encode a probabilistic model
in the form of a program.
Figure~\ref{fig:example_program} shows the Stan program for the original model in the motivating example (Figure~\ref{fig:example}).
The representation is intuitive: the \texttt{data} block declares $N$ observations of data $y$; the \texttt{parameters} block declares 
one parameter \texttt{b} in the model; and the \texttt{model} block encodes that each data observation is conditional on \texttt{b}. 
Given such a probabilistic program, Stan can 
automatically apply inference algorithms like MCMC or VI to compute the posterior of parameters.



\mypara{Transformations}
To allow automated transformations on the probabilistic program, we use 
Storm-IR~\cite{dutta2019storm} as our internal representation. 
Storm-IR can represent program constructs like sampling from
distributions (Dist) and conditioning on data (factor) as a graph with program
elements as nodes, and control flow as edges (similar to a compiler
CFG~\cite{allen1970control}). 
Since Storm-IR supports multiple languages (e.g., Stan, Pyro, Edward), it allows \NAME to be language-agnostic.
\NAME first parses the original probabilistic program into abstract syntax tree
and converts to Storm-IR. On this IR, searching for the code pattern from
Table~\ref{tab:transformations} amounts to searching for a subgraph that encodes
the pattern (e.g., statements corresponding to $\beta \sim
\pi_{\beta}(\alpha)$ and $y_{i=1}^{D}\sim F(\beta)$; which do not need be
adjacent), while remembering the concrete variable names (e.g., $\beta \mapsto$
\texttt{b}, $y \mapsto$ \texttt{y}) and distributions (e.g., $F \mapsto
\mathcal{N}(\texttt{b}, \texttt{1})$). \NAME 
uses the identified
distributions/variables to instantiate the transformation template and update the program.
For example, to apply the Normal-to-Student-T transformation on Figure~\ref{fig:example_program},
\NAME{} will replace the normal distribution on Line~10 with a Student-T
distribution, as \texttt{student\_t(nu,b,1)}, where \texttt{nu} is a new parameter
for the degree of freedom. \NAME will also place a uniform prior on \texttt{nu}.

We show the details of Storm-IR syntax in Appendix A, the
code transformation patterns (on Storm-IR) in Appendix B, and
the proof of correctness (in the sense of code transformations matching the models
from \mbox{Table~\ref{tab:transformations}) in Appendix C.}


\subsection{\NAME Algorithm}

\begin{figure}[b!] %
    \vspace{-0.3in}
    \begin{minipage}{0.48\textwidth}
    \begin{algorithm}[H]
      \caption{\NAME{} Algorithm}
    \label{algo:main-algo}
    \algosize{}  
        \vspace{-.03in}
   \begin{flushleft}    
  \textbf{Input}: Program  $\textit{P}$,  Data $y$,
      Noise Model $A$, Inference Algo $I$,\\
  \hspace{2em} Transformations $T$\\
      \textbf{Output}: \mbox{Transformed Programs Ranked by Robustness}
   \end{flushleft}
        \vspace{-.1in}
    \begin{algorithmic}[1]
        \STATE {\bfseries procedure} {\NAME}{($P$, $y$, $A$, $I$, $T$)}
        \begin{ALC@g}
        \STATE $\textit{Results} \gets \emptyset$    \label{algo:line:resultinit}
        \STATE $\bm{\textit{P}_T} \gets \textit{ApplyTransforms}(P, T)$  \label{algo:line:trans}
        \FOR{$P_T \in \bm{\textit{P}_T}$}  \label{algo:line:progloopstart}
        \STATE $\textit{Score} \gets \emptyset$        
        \FOR{$i \gets 1$ \text{to} \texttt{N}}     \label{algo:line:repeatloopstart}
        \STATE $y^\textit{Noise} \gets A(y)$   \label{algo:line:attack}
        \STATE $\hat{y}  \gets \textit{Infer}(P, y^\textit{Noise}, I)$ \label{algo:line:inferorig}
        \STATE $\hat{y}_T \gets \textit{Infer}(P_{T}, y^\textit{Noise}, I)$     \label{algo:line:infert}
        \STATE $\textit{Score} \gets \textit{Score} \cup
          \{\textit{RIMSE}(\hat{y}, \hat{y}_T, y)\}$ \label{algo:line:score}
        \ENDFOR \label{algo:line:repeatloopend}
        \STATE $\textit{Results} \gets \textit{Results} \cup \{(P_T,
          \textit{Avg}(\textit{Score}))\}$  \label{algo:line:results}
        \ENDFOR                   \label{algo:line:progloopend}
        \end{ALC@g}
        \STATE {\bfseries return} $\textit{Sort}(\textit{Results})$   \label{algo:line:return}
    \end{algorithmic}
    \end{algorithm}
    \end{minipage}
  \end{figure}
  

Algorithm~\ref{algo:main-algo} presents \NAME's main algorithm.
First, \NAME initializes a set, $\textit{Results}$, for storing the robustness
scores of all transformed programs (L.2). %
\NAME generates the transformed programs  $\bm{P_T}$ (L.3). 
Next, \NAME evaluates the robustness of each transformed program (L.4-13).
For each transformed program, $P_T \in \bm{P_T}$, \NAME performs the following
steps \texttt{N} times: it first generates a noisy dataset, $y^\textit{Noise}$,
using the specified noise model $\textit{A}$ (L.7). %
It runs the inference algorithm $I$ selected by the user to estimate the latent
parameters (or posterior data predictions), $\hat{y}$, in program $P$ using the
noisy dataset $y^\textit{Noise}$ (L.8). %
Besides, the user also specifies other inference
specifications such as number of samples (for MCMC) or number of iterations (for
VI). The \textit{Infer} method encapsulates this step. \NAME
infers the parameters of the transformed program $P_T$ on the same noisy dataset
(L.9), %
and computes the robustness score \mbox{using the $\textit{RIMSE}$ metric (L.10).}


\NAME computes the average score (e.g. arithmetic or geometric mean) for the
transformed program $P_T$ and appends the result to the $\textit{Results}$ set
(L.12). %
Averaging the scores over multiple runs (and different noisy data-sets) produces
a better estimate of the robustness of a transformation.
Finally, \NAME returns the list of transformed programs in descending order of
their robustness scores (L.14). %



{\NAME also supports other user-specified robustness metrics, which
    can be specified as a simple function using our python interface.}
Further, unlike \citet{wang2018general}'s approach that uses only synthetic
data (simulated from the original model with known parameters) as $y$, \NAME{} allows users to provide
the uncorrupted data as $y$ if the true data model is unknown.
Given the uncorrupted data $y$, \NAME{} helps users to compare
how different models fit to~$y$.









