\section{Statistical test ablation}\label{appendix:ablation-statistical-test}
Algorithm~\ref{alg:converged} employs a statistical test to decide whether to continue or stop training.
Specifically, training continues as long as the means of the log-weights used for training and a new set of log-weights---i.e., an estimation of ELBO---are statistically different.
An alternative approach would be to compare the distributions of both training and testing log-weights using tests designed for this task.
To examine this, we conducted experiments similar to those in Section~\ref{sec:experiments}, comparing distributions with the \emph{two-sample Kolmogorov-Smirnov test} (KS-test) and the \emph{two-sample Cram\'er-von Mises test} (CvM-test).
The findings are detailed in Table~\ref{tab:ablation-statistical-test}.
Across all cases, the outcomes closely resemble those achieved with the t-test.
Generally, the algorithm runs for slightly longer when using the t-test compared to the KS-test or the CvM-test.
This delay is attributed to the greater statistical power gained from comparing means rather than distributions.
When comparing distributions, the CvM-test yields marginally better results than the KS-test, attributed to the CvM-test's higher statistical power \citep{stephens1974edf}.



\begin{table}[t!]
  \renewcommand{\arraystretch}{1.2}
  \begin{center}
  \begin{tabular}{@{}  l
    S[round-mode=places, round-precision=2]
    S[round-mode=places, round-precision=2]
    p{0.5em}  % <-- Add this line for a phantom column
    S[round-mode=places, round-precision=2]
    S[round-mode=places, round-precision=2]
    @{}}
    \toprule
    & \multicolumn{5}{c}{\makecell{ELBO difference \\
    $\mathrm{(alternative\ test)} - \text{(t-test)}$} }\\
    \cmidrule{2-6}
    & \multicolumn{2}{c}{\makecell{Diagonal Covariance}}  &  & \multicolumn{2}{c}{\makecell{Dense Covariance}} \\
    \cmidrule(lr){2-3} \cmidrule(rl){5-6}
    & \multicolumn{1}{c}{CvM} & \multicolumn{1}{c}{KS} & & \multicolumn{1}{c}{CvM} & \multicolumn{1}{c}{KS} \\
    \midrule
    \textbf{Bayesian log. regr.}\\
    \hspace{1em}a1a               & 0.048218 & 0.016602 && 0.000000 & 0.000000 \\
    \hspace{1em}australian        & -0.011169 & -0.457397 && -0.000153 & -0.000580 \\
    \hspace{1em}ionosphere        & -0.305847 & -0.435760 && -0.002045 & -0.001984 \\
    \hspace{1em}mushrooms         & 0.067688 & -0.548920 && 0.004501 & -0.002533 \\
    \hspace{1em}madelon           & -0.029297 & -0.131592 && 0.000000 & 0.000000 \\
    \hspace{1em}sonar             & -1.538651 & -1.683945 && 0.000000 & -0.000946 \\
    \textbf{Stan models}\\
    \hspace{1em}congress          & -0.045685 & -0.051361 && -0.006104 & -0.064087 \\
    \hspace{1em}election88        & 0.036255 & -0.102417 && 0.282471 & -0.155518 \\
    \hspace{1em}election88Exp     & -1.126709 & -0.891846 && -1.851196 & -6.350830 \\
    \hspace{1em}electric          & 0.005127 & 0.000000 && 0.000122 & 0.000122 \\
    \hspace{1em}electric-one-pred & 0.000000 & 0.000000 && 0.000000 & 0.000000 \\
    \hspace{1em}hepatitis         & -0.005920 & -0.003418 && -0.001221 & -0.001465 \\
    \hspace{1em}hiv-chr           & 0.014954 & -0.066895 && 0.011475 & 0.008789 \\
    \hspace{1em}irt               & -0.003906 & -0.003906 && -0.001953 & 0.000977 \\
    \hspace{1em}mesquite          & -0.035463 & -0.050467 && -0.014023 & -0.040293 \\
    \hspace{1em}radon             & 0.009766 & 0.005859 && -0.001953 & -0.003052 \\
    \hspace{1em}wells             & 0.000000 & -0.004761 && -0.004272 & -0.006348 \\
    \bottomrule
\end{tabular}  
\end{center}
\caption{Comparison of the performance of the two-sample Kolmogorov-Smirnov test (KS) and the two-sample Cram\'er-von Mises test (CvM) for the detection of convergence, as alternatives to the t-test, in the experiments of Section~\ref{sec:experiments}.
The table shows the difference in ELBO when using an alternative test (CvM or KS) instead of the t-test (negative values indicate that a better approximating distribution was found using t-test).
From the results, it is clear that there is not much difference in using alternative statistical tests.
However, the CvM test appears to be a slightly better replacement for the t-test than the KS test, primarily due to the greater statistical power of the CvM test \citep{stephens1974edf}.
This can be observed in the slightly different behavior on the \texttt{mushrooms}, \texttt{sonar}, and \texttt{election88Exp} datasets.
\label{tab:ablation-statistical-test}
}
\end{table}
