This section introduces a novel framework of using a multi-population ES for efficient discovery of multiple high-quality solutions to an optimization problem. 
The idea of this work is to represent each SVGD particle by the mean of an ES search distribution and use the estimated steps of the ES algorithm as the \textit{driving force} in the SVGD particle update. 
Hence, our approach exploits the CMA-ES step-size adaptation mechanism to make gradient-free inference more efficient. 
Intuitively, the reformulated update permits larger particle updates, similar to momentum, especially in flat regions of the target.
Since ES are easily parallelizable on modern GPUs \citep{lange2023evosax, tang2022evojax}, this approach comes at a small additional runtime cost.
In the following, we use $\varrho \in \NNN^+$ to refer to the number of ES search distributions, $n\in \NNN^+$ to denote the size of each sampled population, and $m\in \NNN^+$ for the number of elite samples.
This amounts to a total population size of $\varrho n$ for ES-based algorithms.

Based on the SVGD update in Eq.~\eqref{eq:svgd2} and the CMA-ES update of the search distribution mean in Eq.~\eqref{eq:cma-mu}, we now define \textit{Stein Variational CMA-ES (SV-CMA-ES)}.
The full algorithm is listed in Algorithm \ref{alg:sv-cmaes}.
SV-CMA-ES is a multi-population version of CMA-ES, where $\varrho$ search distributions are updated in parallel, each representing an SVGD particle $\vx_i$ via their distribution mean.
In other words, for each particle, there is a corresponding Gaussian search distribution that is centered at the particle and parametrized as $\NN (\vx_i, \sigma_i^2 \vC_i)$.
%
Given the standard CMA-ES distribution update step $\cma$ from Eq.~\eqref{eq:cma-step} and a sampled population $\vxi_{ij} \sim \NN (\vx_i, \sigma_i^2 \vC_i)$, we propose the following SVGD-based update:
\begin{align}
    \vx_i &\gets \vx_i + \epsilon~\phi (\vx_i) \quad \text{with} \label{eq:sv-cma-m}\\
    \phi(\vx_i) &= \EEE_{\vx_j \sim q} \Bigl[ k(\vx_j, \vx_i)\Delta {\vx_j}_{\textsc{cma}} + \N_{\vx_j} k(\vx_j, \vx_i) \Bigr]\nonumber\\
    = &\frac{1}{\varrho} \sum_{j=1}^\varrho \Biggl[ \underbrace{\Bigl[\sum_{\ell=1}^m w_{j\ell}(\vxi_{j\ell} - \vx_j)\Bigr] k(\vx_j, \vx_i)}_{\text{driving force}} + \underbrace{\vphantom{\Bigl[\sum_{k=1}^m\Bigr]} \N_{\vx_j} k(\vx_j, \vx_i)}_{\text{repulsive force}}\Biggr]\label{eq:sv-cma-step}
\end{align}
where we assume the same sorting by fitness in our sum as in vanilla CMA-ES and $w_{j\ell}$ are the sample weights that are computed based on the fitness values $f(\vxi_{j\ell})$ following \citet{hansen2016cma}.
Further, we use an additional step-size hyperparameter $\epsilon$ for notational consistency with SVGD, but we always fix it to $\epsilon = 1$.

Eq.~\eqref{eq:sv-cma-step} defines how to update each particle search distribution mean.
It now remains to define the remaining SV-CMA-ES parameter updates.
The original CMA-ES step-size update \eqref{eq:cma-sig} is based on the length of the distribution mean update step.
In the particle update in Eq.~\eqref{eq:sv-cma-m}, this quantity corresponds to the effective update step $\phi(\vx_i)$.
Given this particle shift, the smoothened step estimate $\vp_{\sigma_i}$ is computed analogously to the CMA-ES optimization path update in Eq.~\eqref{eq:cma_evo_trace}:
\begin{equation}
    \vp_{\sigma_i} \leftarrow (1 - \alpha_{\sigma}) \vp_{\sigma_i} + \sqrt{\alpha_\sigma (2 - \alpha_\sigma)~ m_{\text{eff}, i}}~\vC_i^{-\frac{1}{2}} \phi(\vx_i) / \sigma_i
\end{equation}
Using the same construction, we update $\vp_{c_i}$ based on $\phi(\vx_i)$, from which the covariance $\vC_i$ can be computed using \Cref{eq:cma-cov}.

% Practical considerations
\subsection{Practical considerations}
We now discuss some modifications to the algorithm that we found beneficial in practice.
As noted earlier, the update of the particle in Eq.~\eqref{eq:sv-cma-step} smoothens the gradient approximation across all particles.
As a result, the magnitude of the effective steps is reduced compared to standard CMA-ES.
Since CMA-ES reduces the step-size $\sigma$ automatically when small steps are taken, this may lead to premature convergence.
An example that illustrates this problem is a bimodal distribution with both modes far apart, such that for most particles $k(\vx, \vy)$ is close to zero for all pairs $\vx, \vy$ that are sampled from different modes.
In this scenario, the driving force term of the update corresponds to the vanilla CMA-ES update, scaled down by the factor of $1 / \varrho$.
Hence, the proposed steps in this scenario would shrink iteratively.
To address this issue, we propose the following simplified particle update:
\begin{equation}
    \begin{split}
        &\phi(\vx_i) = \frac{1}{\varrho} \sum_{j=1}^\varrho \Biggl[ \Bigl[\sum_{\ell=1}^m w_{i\ell}(\vxi_{i\ell} - \vx_i)\Bigr] + \N_{\vx_j} k(\vx_j, \vx_i)\Biggr].\\
    \end{split}\label{eq:sv-cma-final}
\end{equation}
This update uses only the particle $\vx_i$ to estimate the first term of the update, i.e., the driving force.
We note that this corresponds to a hybrid kernel SVGD setting \citep{d2021stein, macdonaldhybrid}, which uses two separate kernels to compute the repulsion and driving force terms: 
$\phi_{\text{hybrid}}(\vx_i) = \EEE_{\vx \sim q} \left[ \N_\vx f(\vx) k_1(\vx, \vx_i) + \N_\vx k_2(\vx, \vx_i) \right]$ if we choose $k_1(\vx, \vy) = n\mathds{1}(\vx = \vy)$.
This kernel can be approximated by an RBF kernel with small bandwidth $h \rightarrow 0$.

While the update in Eq.~\eqref{eq:sv-cma-final} does not possess the same capabilities of transporting particles ``along a necklace’’ as the vanilla SVGD update (cf. Fig.\ 1 of \citet{liu2016stein}), it has been noted that these SVGD capabilities play a limited role for practical problems in the first place \citep{d2021annealed}.
Instead, prior work proposed the annealed update in Eq.~\eqref{eq:asvgd} to transport the particles to regions of high density \citep{d2021annealed, liu2017stein}.
In practice, we observe that using the annealed version of the above update, i.e.,
\begin{align}
    \phi(\vx_i) &=\frac{1}{\varrho} \sum_{j=1}^\varrho \Biggl[ \Bigl[\sum_{\ell=1}^m w_{i\ell}(\vxi_{i\ell} - \vx_i)\Bigr] + \gamma(t) \N_{\vx_j} k(\vx_j, \vx_i)\Biggr]\nonumber\\
     &=  \sum_{\ell=1}^m w_{i\ell}(\vxi_{i\ell} - \vx_i) + \frac{\gamma(t)}{\varrho}\sum_{j=1}^\varrho \N_{\vx_j} k(\vx_j, \vx_i)\label{eq:asv-cma-final}
\end{align}
ensures sufficient mode coverage to efficiently sample from distributions.

The substitution of the score function with the CMA-ES step introduces a bias in comparison to the SVGD update in Eq.~\eqref{eq:asvgd}, meaning it does not strictly adhere to the canonical SVGD framework and does not inherit its robust convergence properties. 
Still, we find that, in practice, the update makes a useful tradeoff which combines the computational efficiency of CMA-ES with the particle set entropy preservation capabilities of SVGD.
We leave a more in-depth theoretical analysis of the algorithm for future work, and present our empirical findings in the subsequent section.
For an empirical convergence analysis, we refer to Appendix \ref{sec:app_convergence}.

\input{contents/sv-cma-es-algorithm}
