\section{Case Studies}
\label{sec:casestudies}
\subsection{Case Study 1: Generalized Linear Regression Model}

\subsubsection{Linear Regression}


We study a simple linear regression model $y_{i=1\ldots D}|x_{i=1\ldots D}\text{\ensuremath{\sim}}\mathcal{N}(w_{1}x_{1i}+w_{2}x_{2i}+w_{3}x_{3i}+w_{4}x_{4i}+w_{5}x_{5i}+b,\sigma^{2})$ .
Here $y_i$s and $x_i$s are the given dataset and 
 $w_{j}$ ($j\in\{1,\ldots,5\}$), $b$, and $\sigma$ are latent parameters.
Here we apply the transformations,
StudentT, Reparam, Reweight and Mixture on the original model. 

To evaluate the robustness of the original and transformed models,
we first generate the ground truth for the parameters 
$w_{j}$, $b$ from $\mathcal{N}(0,1)$ and $x_{i}$s
from $\textit{Unif}(-1,1)$.
Then we use these parameters to generate the 500 training data with noise (Outliers).
To get a test dataset, we also take 500 samples generated in the same way and add no noise.

Table \ref{tab:study_linear_nuts} shows the result for evaluate the
original and transformed models using different robustness metrics.
We repeat the procedure 5 times and report the average of the metric
values. Here we present the result for two noise levels, 0 and 10,
where 0 means no noise is in data $\boldsymbol{y}$. 
We fit all the models using Stan's NUTS, running
4 chains each with 1000 samples. We take the mean of the samples to
obtain a point estimate of all the parameters or predicted data. 

In Table \ref{tab:study_linear_nuts}, when noise level $k=0$, we
can see all the metrics are almost at their best value, i.e., the
model is accurate. Specifically, all the models have $\text{MSE}_{\text{param}}=0.002$
(with $<0.0005$ rounding error) when $k=0$. Intuitively this means
each of the parameters $\beta_{j}\in\{w_{1},w_{2},w_{3},w_{4},w_{5},\ b,\ \sigma\}$
will give the squared error $(\hat{\beta_{j}}-\beta_{j})^{2}\approx0.002$,
and thus means $\hat{\beta_{j}}$ , the posterior sample mean after
fitting the model, has almost no difference from $\beta_{j}$, the
true parameter value we used to generate the data. At noise level
$k=10$, some of the metrics show a less optimal value, which indicates
the model is no longer accurate when there is noise presented in the
training data. For example, when $k=10$, for the original model,
$\text{MSE}_{\text{param}}$ increases from $0$ to $0.30$, which
intuitively means on average each of the parameters $\beta_{j}$ will
have the squared error $(\hat{\beta_{j}}-\beta_{j})^{2}\approx0.257$.
On contrary, for the Student-T transformed model, on average each
$\beta_{j}$ will have $(\hat{\beta_{j}}-\beta_{j})^{2}\approx0.012$,
much smaller than that from the original model. 



All the robustness metrics indicate that the models after StudentT
or Reparam are the most robust ones, followed by Reweight, and the
Original models is the least robust one. 
Take the parameter $w_{1}$ which has the true value of $1.012$,
as an example,  the Student model gives the
mean value $1.010$, the Reparam model gives $1.021$, the Reweight
model gives $0.999$, and the Original model gives $0.965$. 

The reason that Reparam and Student are better than Reweight may be
that they have one more layer of hierarchy than Reweight. All these
three transformations adds one auxiliary parameter for every datapoint.
Reparam and Student has one additional hyperparameter representing
the degree of freedom used in the hyperprior of the auxiliary parameters,
while Reweight use a flat hyperprior with no additional parameter.

If we use a Beta hyperprior with a hyperparameter for Reweight's
auxiliary parameters, i.e., $\textit{weight}\sim\textit{Beta}(\alpha,\alpha)$,
$\text{factor}(\textit{weight}[i]\cdot d(E_{1},\ldots).\text{pdf}(y[i]))$,
Reweight might be able to give a better result than both Reparam and Student.
However, it will drastically increase the inference time and
is more likely to diverge.

\begin{table*}[!htp]\centering
\caption{Evaluation Results for the Linear Regression Model (NUTS)}\label{tab:study_linear_nuts}
\footnotesize
\begin{tabular}{llrrrrrrrrr}\toprule
Transform &(\#params) & $k$ & Time (s) &$\text{MSE}_\text{param}$ &$\text{MSE}_y$ &$\hat{R}$ &$\hat{R}_\text{max}$ &pL1 &pR2 \\\midrule

Original& (7) &0  & 5.12 &0.002 &0.000 &1.000 &1.004 &1.00 &1.00 \\
        & &10 & 4.11 &0.257 &0.026 &1.000 &1.003 &0.97 &1.00 \\\midrule
Reparam &(508)  &0  & 214.36 &0.002 &0.000 &1.000 &1.004 &1.00 &1.00 \\
        & &10 & 93.44 &0.013 &0.004 &1.000 &1.014 &0.99 &1.00 \\\midrule
Reweight &(507) &0  & 115.35 &0.002 &0.000 &1.000 &1.004 &1.00 &1.00 \\
        & &10 & 42.42 &0.087 &0.010 &1.000 &1.003 &0.98 &1.00 \\\midrule
$\text{Reweight}^*$ &(508) &0  & 1003.16 &0.002 &  &1.05 &1.12 &  &  \\
                    &      &10 & 1014.40 &0.006 &  &1.74 &3.53 &  &  \\\midrule
Student &(8) &0  & 36.02 &0.002 &0.000 &1.000 &1.003 &1.00 &1.00 \\
        & &10 & 11.06 &0.012 &0.004 &1.000 &1.003 &0.99 &1.00 \\

\bottomrule
\end{tabular}
\end{table*}

\paragraph*{Timing}

Interestingly, the models may run faster on data with noise than on the
clean data. This is probably because the posterior shape is easier
to sample from when conditioned on noisy data, but this may not hold
for other models. For the parameter $w_{1}$, its posterior has an
average standard deviation of $1\cdot10^{-2}$ with noisy data; while
with clean data, the standard deviation is only $3\cdot10^{-4}$.
Intuitively, with the same prior distribution, when the posterior
from noisy data is more spread-out, a random sample generated by MCMC
will be more likely to get accepted. 

Different prior distributions also affect the timing. For example,
we run the Original model on the same clean data with different priors.
For the parameter $\sigma$, if we specify the limit <lower=0> and
does not specify any other priors, it takes 7.5s; if we leave the
limits and allow Stan to reject illegal samples, it is much faster,
as $5.12$ shown in the table above. This may be because the limits
introduce discontinuity to the distribution and make the sampling
harder. For other parameters, let $w_{i}^{*}$ be the truth value
of $w_{i}$, and if we use the prior $w_{i}\sim(w_{i}^{*},10)$, it
takes 4.96s, slightly faster than default; and with $w_{i}\sim(w_{i}^{*},100)$,
it takes 5.36s and is slower than default.

\subsubsection{Poisson Regression}

The poisson regression model is $y_{i}|x_{i}\text{\ensuremath{\sim}}\textit{Poisson}(\exp(w_{1}x_{1i}+w_{2}x_{2i}+w_{3}x_{3i}+w_{4}x_{4i}+w_{5}x_{5i})),$
and the noise model is $y_{i}|x_{i}\text{\ensuremath{\sim}}\textit{Poisson}(\exp(w_{1}x_{1i}+w_{2}x_{2i}+w_{3}x_{3i}+w_{4}x_{4i}+w_{5}x_{5i}+\epsilon_{i}))$
where $\epsilon_{i}\sim\mathcal{N}(0,k\cdot0.15)$. Different from
the other models, the poisson regression seems harder to converge.
In the table \ref{tab:study_poisson_nuts}, we present the result
for running NUTS with 4 chains for with 10000 iterations. If we use
1000 iterations, most of the transformed models will not converge. 

\begin{table*}[!htp]\centering
\caption{Evaluation Results for the Poisson Regression Model (NUTS)}\label{tab:study_poisson_nuts}
\footnotesize
\begin{tabular}{llrrrrrrrrr}\toprule
Transform &(\#params) & $k$ & Time (s) &$\text{MSE}_\text{param}$ &$\text{MSE}_y$ &$\hat{R}$ &$\hat{R}_\text{max}$ &pL1 &pR2 \\\midrule

Original &(5) &0 &17.052 &0.000 & $1.65\cdot 10^5$ &1.000 &1.001 &0.997 &1.000 \\
& &10 &17.248 &0.057 & $6.58\cdot 10^{10}$ &1.000 &1.000 &-0.487 &-3.307 \\\midrule
Reweight &(505) &0 &146.864 &0.000 & $2.48\cdot 10^{4}$ &1.000 &1.000 &0.998 &1.000 \\
& &10 &136.662 &0.006 & $4.65\cdot 10^{9}$ &1.000 &1.011 &0.561 &0.695 \\

\bottomrule
\end{tabular}
\end{table*}

From the results, we can see the Reweight model works best: it
is able to give a small MSE for parameters even when the data contains
noise. Specifically, in one run, the true value of $w_{1}$ is $1.31$.
With clean data, both the original model and the Reweight transformed
model are able to give the correct results. With noise level $k=10$,
the Reweight model gives $1.28$; while the original model gives $1.21$. 


\subsubsection{Logistic Regression}

For the logistics model, we sort the data based on their probability
and then start flipping the label from the ones with the lowest scores,
as in \cite{wang2018general}. We can
see the Reweight model gives better results than the Original one,
but the difference is not large, because the model of flipping binary
labels has little effect on both models. 

\subsection{Case Study 2: Mixture Model}

\subsubsection{gauss\_mix\_given\_theta}

The noise model for the mixture model by adding one more noise
group with mean $\max(\mu_{1},\mu_{2})+|\mu_{1}-\mu_{2}|$, meaning
that a new group is to the right of the two groups and forms three
groups with equal intervals. The new group has probability $p_{3}=0.02k$,
and the original two groups have probability $p_{1}=(1-p_{3})\cdot\theta$
and $p_{2}=(1-p_{3})\cdot(1-\theta)=1-p_{3}-p_{1}$. In the results,
Reparam and Student ranks first, and then Original, which is similar
to the linear regression model. However, the Reweight model does not
work: it misclassifies two group means to places between the noise
group and the group to the right. The other transformations, Reparam
and Student are able to filter out the noise group and identify the
true groups. 


\subsection{Case Study 3: Time Series Models}

\subsubsection{koyck}

The koyck model is $y_{t}\sim\mathcal{N}(w_{1}+w_{2}*x_{t}+w_{3}*y_{t-1},\sigma)$.
We use the Outliers noise model. Notice the form of this model is similar
to the linear regression models. The difference is the additional
dependency on previous timestamp. The results are also similar: 
Student and Reparam are the best, then Reweight. All the three transformations
are more robust than Original.

\subsubsection{gp-fit-latent}
The gp-fit-latent model is 
\begin{align*}
\rho & \sim\textit{InvGamma}(5,5)\\
\alpha & \sim\mathcal{N}(0,1)\\
\sigma & \sim\mathcal{N}(0,1)\\
f & \sim\textit{MultivariateNormal}(0,K(x|\alpha,\rho))\\
y_{i} & \sim\mathcal{N}(f_{i},\sigma)\ \ \forall i\in\{1,\ldots,N\}
\end{align*}
Here $K$ is a exponentiated quadratic kernel.
Under the Outliers noise model, Reparam and Student are the best, then Reweight.






