\subsection{Stein Variational Gradient Descent}
Stein Variational Gradient Descent \citep[SVGD]{liu2016stein} is a non-parametric inference algorithm that approximates a target distribution with a set of $\varrho \in \NNN^+$ particles $X = \{\vx_i\}^\varrho_{i=1}$ as $q(\vx) = \sum_{\vx_i \in X} \delta (\vx - \vx_i) / \varrho$, where we use $\delta (\cdot)$ to denote the Dirac delta function.
Given an initial set of particles, the goal is to determine an optimal particle transformation $\phi^*: \RRR^d \to \RRR^d$ that maximally decreases the KL divergence $D_{\text{KL}} (q~ \Vert~ p)$:
\begin{align}
    \begin{split}
        &\vx_i \gets \vx_i + \epsilon \phi^*(\vx_i), \quad \forall \vx_i \in X \label{eq:svgd1}\\
        \text{s.t.\ } &\phi^* = \argmin_{\phi \in \FF} \left\{ \left. \frac{d}{d\epsilon} D_{\text{KL}}(q_{[\epsilon \phi]}~ \Vert~ p) \right|_{\epsilon = 0} \right\},
    \end{split}
\end{align}
where $\epsilon \in \RRR$ is a sufficiently small step-size, $q_{[\epsilon \phi]}$ denotes the distribution of the updated particles, and $\FF$ is a set of candidate transformations.

The main result by \citet{liu2016stein} is the derivation of a closed form solution to this optimization problem.
By choosing $\FF$ as the unit sphere $\BB_k$ in a vector-valued reproducing kernel Hilbert space $\HH_k^d$, i.e.\, $\FF_k = \BB_k  \{\phi \in \HH_k^d: \lVert \phi \rVert_{\HH_k^d} \leq 1\}$, with its kernel function $k(\cdot, \cdot): \RRR^d \times \RRR^d \to \RRR$, the authors show that the solution to Eq.~\eqref{eq:svgd1} is:
\begin{align}
    \phi^*_k (\cdot) \propto \EEE_{\vx \sim q} \bigl[ \underbrace{\nabla_\vx \log p(\vx)k(\vx, \cdot)}_{\text{driving force}} + \underbrace{\nabla_\vx k(\vx, \cdot)}_{\text{repulsive force}} \bigr].\label{eq:svgd2}
\end{align}
This result can be used to update the particle set iteratively using Eq.~\eqref{eq:svgd1} and \eqref{eq:svgd2}, where the expectation is estimated via MC approximation over the entire particle set $X$.
Intuitively, the particle update balances likelihood maximization and particle repulsion:
the first term drives particles toward regions of higher probability, while the second term counteracts this by repulsing particles based on the kernel gradient \citep{d2021annealed, ba2021understanding}.

Because vanilla SVGD is prone to the initialization of particles and mode collapse \citep{zhuo2018message, ba2021understanding, zhang2020stochastic}, prior work proposed \textit{Annealed SVGD} \citep{liu2017stein, d2021annealed}.
This extension of SVGD, reweighs the terms in the update based on the optimization progress \citep{d2021annealed}. 
Given the timestep-dependent temperature parameter $\gamma(t) \in \RRR$, the annealed update is:
\begin{align}
    \phi^*_k (\cdot) \propto \EEE_{\vx \sim q} \left[ \nabla_\vx \log p(\vx)k(\vx, \cdot) + \gamma(t) \nabla_\vx k(\vx, \cdot) \right].\label{eq:asvgd}
\end{align}

\subsection{Covariance Matrix Adaptation Evolution Strategy}
The Covariance Matrix Adaptation Evolution Strategy \citep[CMA-ES]{hansen2001} is one of the most popular ES algorithms. 
We therefore choose it as the starting point for our ES-based SVGD method.
The core idea of the CMA-ES algorithm is to iteratively optimize the parameters of a Gaussian search distribution $\NN (\vx, \sigma^2 \vC)$ from which the candidate solutions are sampled.
While the algorithmic intuition of CMA-ES is similar to MC gradient approaches \citep{salimans2017evolution}, CMA-ES updates the search distribution following natural gradient steps, which has been shown to produce more efficient steps than standard gradient descent on multiple problems \citep{martens2020new, akimoto2012theoretical, glasmachers2010exponential}.

We note that our notation in the following deviates from the default notation in the ES literature, as some of its variable names are typically associated with a different meaning compared to the variational inference (VI) literature. 
For instance, $\mu$ commonly is the symbol for the mean of a Gaussian in VI literature, while it refers to the number of elites in CMA-ES.
To improve clarity, we thus use the variable $n$ to denote the size of a sampled CMA-ES population and $m$ for the number of selected elites.
In our notation, CMA-ES is therefore an $(m, n)$ strategy.
The CMA-ES algorithm relies on multiple hyperparameters which we fix to the default values from \citet{hansen2016cma}.
For completeness, we include the definitions of these variables -- $w_i, \alpha_1, \alpha_m, \alpha_\sigma, h_\sigma, d(h_\sigma)$ and $\bar{w}_i$ -- in the \hyperref[secAppendix]{Appendix}.
Further, we slightly overload notation by using $\vp$ to denote the evolution path updates following \citet{hansen2016cma} (unlike pdf's which we denote by $p$).

Given a population of $n$ candidate samples $\vxi_i \sim \NN (\vx, \sigma^2 \vC)$, each iteration of CMA-ES updates the search parameters as follows.
First, the samples $\vxi_i$ are evaluated and ranked by their fitness $f(\vxi_i)$ in ascending order.
To simplify notation, we assume ranked solutions in the following, i.e., we assume that index $i < j \rightarrow f(\vxi_i) \leq f(\vxi_j)$.
This allows to assign each sample to a mutation weight $w_i$, where weights for better solutions are higher.
For details on the exact computation of the weights, we refer to \citet{hansen2016cma} and to Appendix \ref{secAlgSuppl} of our work.
The mean of the search distribution is then updated by mutating the $m \leq n$ best samples from the current generation of candidates:
\begin{align}
    &\vx \gets \vx + \cma \label{eq:cma-mu}\\ 
    \intertext{where }
    &\cma = \sigma \tsum_{i=1}^m w_i \vy_i, \text{ and } \vy_i = (\vxi_i - \vx) / \sigma. \label{eq:cma-step}
\end{align}
Next, the parameters of the search distribution are updated.
First, the step-size $\sigma$ is updated based on the history of prior steps.
Given
\begin{align}
    &m_{\text{eff}} = (\Sigma_{i=1}^m w_i^2)^{-1}, \text{ and }\label{eq:mueff}\\
    &\vp_{\sigma} \leftarrow (1 - \alpha_{\sigma}) \vp_{\sigma} + \sqrt{\alpha_\sigma (2 - \alpha_\sigma)~ m_{\text{eff}}}~\vC^{-\frac{1}{2}} \cma/\sigma \label{eq:cma_evo_trace},\\
\intertext{we define}
    &\sigma \leftarrow \sigma \times \exp\Bigl(\frac{\alpha_{\sigma}}{d_{\sigma}} \Bigl(\frac{\lVert \vp_{\sigma}\rVert}{\EEE \lVert \NN (0, \Id) \rVert} - 1\Bigr)\Bigr),\label{eq:cma-sig}
\end{align}
where $\vp_{\sigma}$ is a moving average over the optimization steps, which unprojects the steps using $C^{-1/2}/\sigma$, so the resulting vector follows a standard normal.
Thus, Eq.~\eqref{eq:cma-sig} automatically adapts the step-size based on the expected length of steps, similar to momentum-based optimizers \citep{kingma2014adam, nesterov1983method}, with the hyperparameters $\alpha_\sigma$ and $d_\sigma$ governing the rate of the step-size changes.
%
Finally, the covariance $\vC$ is updated based on the covariance of the previous steps and current population fitness values:
\begin{align}
    \begin{split}
       &\vC \leftarrow (1 + \alpha_1 d(h_\sigma)- \alpha_1 - \alpha_m \tsum_{j=1}^n w_j)\vC\\
       &\qquad + \alpha_1 \vp_c {\vp_c}^T + \alpha_m \tsum_{i=1}^n \bar{w}_i \vy_i \vy_i^T,
    \end{split}\label{eq:cma-cov}
\intertext{with } 
    &\vp_c \leftarrow (1 - \alpha_c) \vp_c + h_{\sigma} \sqrt{\alpha_c (2-\alpha_c) m_{\text{eff}}}~ \cma/\sigma.\label{eq:cma-pc}
\end{align}
In words, the covariance update performs smoothing over the optimization path to update $\vC$ based on the within- and between-step covariance of well-performing solutions.
Thus, Eq.~\eqref{eq:cma-cov} scales $\vC$ along directions of successful steps to make the search converge faster.
