\section{Introduction}
\label{sec:introduction}

\begin{figure}[h]
    \includegraphics[width=\columnwidth]{plots/fixed/ADAMDoGCoord_reverse_normalized_svg-tex.pdf}
    \includegraphics[width=\columnwidth]{plots/legend-cropped.pdf}
    \caption{Relative Coreset MCMC posterior approximation error comparing ADAM (with different learning rates) and 
    the proposed Hot DoG method (under our recommended setting).
    The metric plotted is
    the ratio of average squared z-scores (defined in \cref{eq:avg_sq_z}) under ADAM to those under Hot DoG.
    Values above the horizontal black line ($10^0$) indicate that the proposed Hot DoG method outperformed ADAM.
    Median values after 200,000 optimization iterations across 10 trials
    are used for the relative comparison for a variety of datasets, models, and coreset sizes.
    }
    \label{fig:NVDoG_ADAM}
\end{figure}

Bayesian inference provides a flexible framework for parameter estimation and uncertainty quantification in 
statistical models. Markov chain Monte Carlo [\citealp{robert1999monte}; \citealp{robert2011short}; 
\citealp[Chs.~11 and 12]{gelman2013bayesian}], the standard methodology for 
performing Bayesian inference, involves simulating carefully constructed Markov chains whose stationary distribution 
is the target Bayesian posterior. In the large-scale data setting, this procedure can become prohibitively expensive, 
as it requires iterating over the entire data set to simulate the next state. 

\emph{Bayesian coresets} \citep{huggins2016coresets}
are a popular approach for speeding up Bayesian inference in the 
large-scale data setting.
A Bayesian coreset
is a weighted subset of data that replaces the full data set 
during inference, leveraging the insight that large datasets often exhibit a 
significant degree of redundancy.\footnote{A related approach, \emph{data distillation}, constructs a small 
synthetic data set for downstream tasks. However, this approach often requires bespoke methods for non-real-valued data
(see [\citealp[Sec.~3]{sachdeva2023data}]). In contrast, Bayesian coresets do not modify individual data points, 
and so are fully generic.} 
With a carefully constructed coreset, one can significantly reduce the computational cost 
of inference while still obtaining samples from a high quality 
approximation of the full Bayesian posterior. In fact, given a data set of $N$ points, a 
coreset of size $\scO\left(\log N\right)$ is sufficient for providing a near-exact posterior approximation 
in exponential family and other sufficiently simple models [\citealp[Thms.~4.1 and 4.2]{naik2022fast}; \citealp[Prop.~3.1]{chen2022bayesian}]
and $\scO\lt(\operatorname{polylog} N\rt)$ is sufficient for more general cases \citep[Cor.~6.1]{campbell2024general}.

Constructing a coreset involves picking the data points to include in the coreset and assigning each data point its 
corresponding weight. The state-of-the-art method, Coreset MCMC \citep{chen2024coreset}, selects coreset 
points by sampling them uniformly from the full data set, and learns the weights using stochastic gradient optimization techniques, e.g., ADAM \citep{kingma2014adam}, 
where the gradients are estimated using MCMC draws targeting the current coreset posterior. 
However, as we demonstrate in this paper, there are two issues with this approach.
First, the quality of the constructed coreset is sensitive to the learning rate of the 
stochastic optimization algorithm. And second, gradient estimates using MCMC draws
are affected strongly in early iterations by initialization bias, leading to poor 
optimization performance.

To address these challenges, we first propose 
\emph{Hot-start Distance over Gradient} (Hot DoG), a tuning-free stochastic
gradient optimization procedure that can be used for learning coreset weights
in Coreset MCMC. Hot DoG is a stochastic gradient method combining techniques from Do(W)G
\citep{ivgi2023dog,khaled2023dowg}, ADAM \citep{kingma2014adam}, and RMSProp
\citep{hinton2012neural} to set learning rates automatically. Hot DoG also includes an
automated warm-up phase prior to weight optimization, which guards against usage
of low quality MCMC draws when estimating the objective function gradients.
We then analyze the convergence behaviour of Hot DoG in a representative setting.
Empirically, \cref{fig:NVDoG_ADAM} demonstrates that Hot DoG under our recommended setting
performs competitively to optimally-tuned ADAM across a wide range of models, datasets, and coreset sizes, 
and can be multiple orders of magnitude more accurate than ADAM using other learning rates.
Beyond the results shown in \cref{fig:NVDoG_ADAM}, we provide an extensive 
empirical investigation of the reliability of Hot DoG in comparison to other methods across 
various synthetic and real experiments. 

