

\section{Experiments}
\label{sec:experiments}

\begin{figure*}
    \begin{subfigure}{0.33\textwidth}
    \includegraphics[width=\columnwidth]{plots/fixed/gaussian_location_coresetMCMC_metrics_combined_1000_ADAMDoGCoord.png}
    \caption{Gaussian location}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
    \includegraphics[width=\columnwidth]{plots/fixed/sparse_regression_coresetMCMC_metrics_combined_1000_ADAMDoGCoord.png}
    \caption{Sparse regression}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
    \includegraphics[width=\columnwidth]{plots/fixed/linear_regression_coresetMCMC_metrics_combined_1000_ADAMDoGCoord.png}
    \caption{Linear regression}\label{fig:burnincomparison-linear}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
    \includegraphics[width=\columnwidth]{plots/fixed/logistic_regression_coresetMCMC_metrics_combined_1000_ADAMDoGCoord.png}
    \caption{Logistic regression}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
    \includegraphics[width=\columnwidth]{plots/fixed/poisson_regression_coresetMCMC_metrics_combined_1000_ADAMDoGCoord.png}
    \caption{Poisson regression}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
    \includegraphics[width=\columnwidth]{plots/fixed/bradley_terry_coresetMCMC_metrics_combined_1000_ADAMDoGCoord.png}
    \caption{Bradley-Terry}
    \end{subfigure}
    \caption{Traces of average squared coordinate-wise z-scores between the true and approximated posterior 
    across all experiments, obtained using Hot DoG with and without hot-start test. 
    All figures share the legend in \cref{fig:burnincomparison-linear}. The coreset size $M$ is $1000$ and each line
    represents a different initial learning rate parameter. The lines indicate the median from $10$ runs.
    Orange lines indicate runs with hot-start test and blue lines without.}
    \label{fig:burnincomparison}
\end{figure*}

\begin{figure*}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\columnwidth]{plots/gaussian_location_burnin.png}
        \caption{Gaussian location}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\columnwidth]{plots/sparse_reg_burnin.png}
        \caption{Sparse regression}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\columnwidth]{plots/lin_reg_burnin.png}
        \caption{Linear regression}\label{fig:burnintest-linear}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\columnwidth]{plots/log_reg_burnin.png}
        \caption{Logistic regression}\label{fig:burnintest-logistic}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\columnwidth]{plots/poi_reg_burnin.png}
        \caption{Poisson regression}\label{fig:burnintest-poiss}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\columnwidth]{plots/bradley_terry_burnin.png}
        \caption{Bradley-Terry}
    \end{subfigure}
    \caption{Trace of gradient estimate norms (blue) and hot-start test statistics (green) before weight optimization
            across all experiments with $M=1000$.
            The orange horizontal line is the test statistic threshold $c=0.5$.}
    \label{fig:burnintest}
\end{figure*}

\begin{figure*}
    \begin{subfigure}{0.33\textwidth}
    \includegraphics[width=\columnwidth]{plots/fixed/gaussian_location_coresetMCMC_metrics_mix_1000_ADAMDoGCoord.png}
    \caption{Gaussian location}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
    \includegraphics[width=\columnwidth]{plots/fixed/sparse_regression_coresetMCMC_metrics_mix_1000_ADAMDoGCoord.png}
    \caption{Sparse regression}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
    \includegraphics[width=\columnwidth]{plots/fixed/linear_regression_coresetMCMC_metrics_mix_1000_ADAMDoGCoord.png}
    \caption{Linear regression}\label{fig:tracecombined-linear}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
    \includegraphics[width=\columnwidth]{plots/fixed/logistic_regression_coresetMCMC_metrics_mix_1000_ADAMDoGCoord.png}
    \caption{Logistic regression}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
    \includegraphics[width=\columnwidth]{plots/fixed/poisson_regression_coresetMCMC_metrics_mix_1000_ADAMDoGCoord.png}
    \caption{Poisson regression}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
    \includegraphics[width=\columnwidth]{plots/fixed/bradley_terry_coresetMCMC_metrics_mix_1000_ADAMDoGCoord.png}
    \caption{Bradley-Terry}
    \end{subfigure}
    \caption{Traces of average squared coordinate-wise z-scores between the true and approximated posterior 
    across all experiments, obtained from Hot DoG and optimally-tuned ADAM. 
    All figures share the legend in \cref{fig:tracecombined-linear}.
    The coreset size $M=1000$ and each line represents a different initial learning rate parameter. 
    The lines indicate the median from $10$ runs.}
    \label{fig:tracecombined}
\end{figure*}

\begin{figure*}
    \begin{subfigure}{0.24\textwidth}
        \includegraphics[width=\columnwidth]{plots/ADAMDoGCoord_normalized_DoG_reverse.png}
        \includegraphics[width=\columnwidth]{plots/legend.png}
        \caption{DoG}
    \end{subfigure}
    \begin{subfigure}{0.24\textwidth}
        \includegraphics[width=\columnwidth]{plots/ADAMDoGCoord_normalized_DoWG_reverse.png}
        \includegraphics[width=\columnwidth]{plots/legend.png}
        \caption{DoWG}
    \end{subfigure}
    \begin{subfigure}{0.24\textwidth}
        \includegraphics[width=\columnwidth]{plots/fixed/ADAMDoGCoord_normalized_adam_reverse.png}
        \includegraphics[width=\columnwidth]{plots/legend.png}
        \caption{ADAM}
    \end{subfigure}
    \begin{subfigure}{0.24\textwidth}
        \includegraphics[width=\columnwidth]{plots/ADAMDoGCoord_normalized_padam_reverse.png}
        \includegraphics[width=\columnwidth]{plots/legend.png}
        \caption{prodigy ADAM}
    \end{subfigure}
    \begin{subfigure}{0.24\textwidth}
        \includegraphics[width=\columnwidth]{plots/ADAMDoGCoord_normalized_DoG_reverse_mix.png}
        \includegraphics[width=\columnwidth]{plots/legend.png}
        \caption{DoG with hot-start}
    \end{subfigure}
    \begin{subfigure}{0.24\textwidth}
        \includegraphics[width=\columnwidth]{plots/ADAMDoGCoord_normalized_DoWG_reverse_mix.png}
        \includegraphics[width=\columnwidth]{plots/legend.png}
        \caption{DoWG with hot-start}
    \end{subfigure}
    \begin{subfigure}{0.24\textwidth}
        \includegraphics[width=\columnwidth]{plots/fixed/ADAMDoGCoord_normalized_adam_reverse_mix.png}
        \includegraphics[width=\columnwidth]{plots/legend.png}
        \caption{ADAM with hot-start}
    \end{subfigure}
    \begin{subfigure}{0.24\textwidth}
        \includegraphics[width=\columnwidth]{plots/ADAMDoGCoord_normalized_padam_reverse_mix.png}
        \includegraphics[width=\columnwidth]{plots/legend.png}
        \caption{prodigy ADAM with hot-start}
    \end{subfigure}
    \caption{Relative Coreset MCMC posterior approximation error comparing different optimization algorithms 
    (labeled in the subfigure captions) and the proposed Hot DoG method (with fixed $r=0.001$ and $c=0.5$).
    The metric plotted is
    the ratio of average squared z-scores (defined in \cref{eq:avg_sq_z}) under the algorithm labeled in each 
    subfigure caption to those under Hot DoG.
    Values above the horizontal black line ($10^0$) indicate that 
    the proposed Hot DoG method outperformed the method it compared to.
    Median values after $200,000$ optimization iterations across $10$ trials are used for the relative comparison 
    for a variety of datasets, models, and coreset sizes.}
    \label{fig:comparison_relative}
\end{figure*}

In this section, we demonstrate the effectiveness of Hot DoG and compare our method against other learning-rate-free 
stochastic gradient methods: optimally-tuned ADAM from a log scale grid search, as well as prodigy ADAM 
 \citep{mishchenko2023prodigy}, DoG \citep{ivgi2023dog}, and DoWG \citep{khaled2023dowg} over different initial parameters. 
We compare the quality of posterior approximations over different coreset sizes $M$ and weight optimization procedures.
Following \citet{chen2024coreset},
we set the number of Markov chains to $K=2$
and subsample size to $S=M$ in \cref{eq:gradest}.
We set
$\kappa_w$ to the hit-and-run slice sampler with doubling
\citep{belisle1993hit,neal2003slice} for all real data experiments.
For the Gaussian location model, we use a kernel that directly samples from $\pi_w$ [\citealp[Sec.~3.4]{chen2024coreset}];
for the sparse regression example, we use Gibbs sampling \citep{george1993variable}.

We compare these algorithms using six different Bayesian 
models, the details of which are in \cref{sec:appendix_4}.
We use Stan \citep{carpenter2017stan} to obtain full data inference results for real data experiments, 
and Gibbs sampling \citep{george1993variable} for the sparse regression model with discrete variables. 
For all experiments, we measure the posterior approximation quality using the 
average squared z-score, which we define as 
\[
    \frac{1}{D}\sum_{i=1}^D ( \frac{\mu_i - \hat{\mu}_i}{\sigma_i})^2. \label{eq:avg_sq_z}
\] 
In the above definition, $D$ denotes the dimension of $\Theta$; 
$\mu_i$ and $\sigma_i$ are, respectively, the coordinate-wise mean and standard deviation estimated using the full data posterior, 
and $\hat{\mu}_i$ is the coordinate-wise mean estimated using draws from Coreset MCMC.
This estimate is computed in a streaming fashion using the second half of all draws 
at the time; note this includes draws from $\pi_{w_0}$ before the hot-start test passes.

Each algorithm was run on 8 single-threaded cores of a 2.1GHz Intel Xeon Gold 6130 processor with 32GB memory. 
Code for these experiments is available at \url{https://github.com/NaitongChen/automated-coreset-mcmc-experiments}.
More experimental details and additional plots are in \cref{sec:appendix_4,sec:appendix_5}.

\textbf{Effect of hot-start test.}
\cref{fig:burnincomparison} compares Hot DoG with and without the hot-start test for $M=1000$ across all experiments;
the same plots for other coreset sizes can be found in \cref{sec:appendix_5}. 
Without the hot-start test, the traces often hit a long plateau, before the effect of 
exponentially-weighted averaging is able to decay early large gradient norms. 
On the other hand, with burn-in, we begin by simulating from Markov chains 
targeting $\pi_{w_0}$, and start optimizing the coreset weights only after the hot-start test has passed. 
In terms of the number of log potential evaluations, Hot DoG with 
burn-in leaves the plateau sooner than without burn-in. % phase.

\cref{fig:burnintest} examines the behaviour of the hot-start test in more detail, showing the traces 
of the gradient estimate norms $\|\hat{g}_t\|$ and test statistics \texttt{median}$(u_1,\dots,u_K)$ across optimization 
iterations when using Hot DoG. 
Here we only show plots for $M=1000$; the same plots for other 
coreset sizes can be found in \cref{sec:appendix_5}. 
In some experiments, the Markov chains are initialized reasonably well where 
the gradient norms are already stabilized, and the test passes almost immediately.
In others, the Markov chains are initialized poorly 
and the gradient norms are large, but nevertheless, the hot-start test passes
shortly after they stabilize. Across all 
experiments,
a test statistic threshold of 0.5 worked well.

\textbf{Robustness to fixed parameter $r$.}
\Cref{fig:tracecombined} provides an examination of the
robustness of the proposed method to the fixed initial learning rate parameter $r$. 
Across all experiments, different values of $r$ spanning multiple orders of magnitude 
result in similar posterior approximations across optimization iterations. Note that  $M$ is $1000$ for all plots in 
\cref{fig:tracecombined}. The same trends can be observed over different coreset sizes (see \cref{sec:appendix_5}).
In practice, we follow the recommendation of \citet{ivgi2023dog} and set $r=0.001$. 

\textbf{Comparison with other related methods.}
\Cref{fig:comparison_relative} shows a comparison between our method and DoG, DoWG, ADAM, as well as prodigy ADAM.
We fix $r=0.001$ and $c=0.5$ for Hot DoG. 
Since the hot-start test itself can be applied to all methods, Hot DoG is compared against 
others both with and without burn-in. The posterior approximation quality of 
Hot DoG is orders of magnitude better than all other methods in many settings tested, and remain competitive otherwise. 
In particular, Hot DoG is capable of matching the performance of optimally-tuned ADAM without tuning.