

\section{\Robts}
\label{sec:transform}
We describe various \robts for probabilistic models from the literature
that we use in our study.

\mypara{Bayesian Data Reweighting}
This \tran changes the contribution of each data sample (observation) by raising
its likelihood term in the model to its own weight~\citep{Wang:2017}. The
weights are then exposed as latent variables and inferred along with the rest of
the model's parameters. During inference, the outliers are automatically
assigned lower weights, which improves prediction.
Table~\ref{tab:transformations}(a) presents an example of an original model and
its transformed version. The transformation introduces a vector of weights $w$ with a Beta prior. $y_{i}\sim F(\beta)^{w_i}$ denotes that the 
likelihood $F$ for each data sample \mbox{is raised to the power of its weight.}

 







\mypara{Localization}
This \tran allows each likelihood term to depend on its own copy of latent
variable~\citep{wang2018general}. Table~\ref{tab:transformations}(b) presents an
example original and robustified probabilistic models. In the
transformed model, there are $D$ local versions of the latent variables:
$\eta_i$, one for each data point $y_i$. All the auxiliary local variables are
sampled from prior $\pi_{\eta}$. We use a
Gaussian prior for $\eta_i$ in evaluation, following the examples in the original work
\cite{wang2018general}.
Unlike \cite{wang2018general} that designs a specialized E-M algorithm to fit $s$ in the Gaussian prior, we fit $s$ with other parameters via Bayesian inference.















\mypara{Normal to Student-T} Normal distribution is not robust to outliers or
over-dispersed data. An easy alternative is the Student-T
distribution~\cite{berger1994overview}. Intuitively, the fatter tail of
Student-T can better capture the data points far away from the majority. In this
\tran, \robustify{} replaces a Normal distribution with Student-T by preserving
the location and scale parameters in the program, while adding a new parameter
$\nu$ as the degree of freedom (DOF).
Table~\ref{tab:transformations}(c) presents the transformation.
In the transformed model, $\nu$ is from the prior $\pi_{\nu}$. Since we may not have prior knowledge, we use
a uniform (non-informative) prior for $\nu$.


\mypara{Reparameterization and Localization of the Scale Parameter}
\label{sec:reparam}
This \tran changes the Gaussian likelihood distribution to an equivalent of
Student-T distribution and also localizes the additional parameter $\tau$.
Table~\ref{tab:transformations}(d) presents an example. The transformation adds
$D$ parameters $\tau_i$ to adjust the standard deviation of the likelihood for
each data point. This has similar effects as the Localization transformation.
$\tau_i$ is from a \textit{Gamma} prior with hyper-parameter $\nu$. If we integrate out all the $\tau_i$s, $\nu$ will be equivalent to
the DOF parameter in the Normal to Student-T transformation
\cite{stanmanualreparam} (but can be more amenable when sampled with MCMC
algorithms).
This \tran is only applicable for Normal distributions.







\mypara{Contaminated Group Mixture} To make the model capture a small amount
of corruption in data, we can encode in the model that the data is from a
mixture of the original model and a outlier group~\citep{berger1994overview}.
Table~\ref{tab:transformations}(e) presents an example. With
probability $1-\rho_{\textit{out}}$, the data point is from the original model;
with $\rho_{\textit{out}}$, the data point comes from another distribution with
a different (likely larger) variance. Benefitting from the outlier group, the
contaminated data will not directly affect the original model's parameters.  $\rho_{\textit{out}}$ and the scale of the new
group are latent parameters which can adapt to the user's data. To ensure a positive
scale parameter, we set the outlier group scale parameter to be $\sqrt{e^\nu}$ where $\nu$ is another hyper-parameter.






























