We compare SV-CMA-ES against the two existing approaches for zero-order SVGD from the literature: \textit{GF-SVGD} as state-of-the-art method for surrogate-based inference, and the MC gradient SVGD as state-of-the-art gradient approximation method. 
We refer to the latter as \textit{SV-OpenAI-ES} throughout the remainder of the paper following the naming convention of the ES community.
Furthermore, we compare against gradient-based SVGD, which we denote as $\N$-SVGD in the following.
All strategies have been implemented based on the evosax library \citep{lange2023evosax}.

To guarantee a fair comparison, we keep the number of function evaluations equal for all methods.
In other words, if the ES-based methods are evaluated for 4 particles, each sampling subpopulations of size 16, we evaluate GF-SVGD and $\N$-SVGD with 64 particles.
For each kernel-based method, we use the standard RBF kernel.
For GF-SVGD, we follow the setup of \citet{han2018stein} and use the same kernel function for the SVGD kernel $k$ and the surrogate kernel $k_\rho$, as well as an isotropic Gaussian prior $\NN(0, \sigma^2 \Id)$.
The optimal scale $\sigma^2$ of the prior is found via a hyperparameter grid search. 
For each method, we implement the annealed version of $\N$-SVGD (cf. Eq.\ \eqref{eq:asvgd}) using a logarithmic schedule $\gamma(t) = \max(\log(T / t), 1)$.
Unless specified differently, this choice is followed in all experiments.
% Although it is common practice to use the median heuristic to estimate the kernel bandwidth for SVGD \citep{liu2016stein}, we select the optimal bandwidth via hyperparameter tuning because ES assume a stationary target distribution for convergence, which is not given with a changing kernel function.
For all methods that require an internal optimizer, we use Adam \citep{kingma2014adam}.
We carefully search for the best hyperparameters for each algorithm separately, to guarantee a fair comparison.
The full details of our experimental setup can be found in the Appendix \ref{secExpDets}.
Moreover, we refer to Appendix \ref{secSupplRes} for additional results including ablation studies and empirical runtime and convergence analyses.

\subsection{Sampling from Synthetic Densities}\label{secExpSynth}

\input{contents/sample_fig}

\begin{figure*}[!btp]
    \centering
    \subfloat[Gaussian Mixture]{
        \includegraphics[width=.21\linewidth]{imgs/gmm/GMM_mmd_convergence_parallel}
    }\hfill
    \subfloat[Double Banana]{
        \includegraphics[width=.21\linewidth]{imgs/banana/banana_mmd_convergence_parallel}
    }\hfill
    \subfloat[Motion Planning]{
        \includegraphics[width=.225\linewidth]{imgs/ramos/F_RAMOS_mmd_convergence_parallel}
    }\hfill
    \subfloat[MMD \wrt $\N$-SVGD]{
        \includegraphics[width=.21\linewidth]{imgs/approximation/meta_mmd_convergence_parallel}
    }\hfill
    \caption{\textbf{(a)-(c):} MMD \wrt \textit{ground truth}  samples on the synthetic densities depicted in \Cref{fig:samples}. \textbf{(d):} Mean log10 MMD across all three sampling tasks \wrt the \textit{samples obtained by gradient-based SVGD}. 
    All results are averaged across 10 independent runs ($\pm 1.96$ standard error). 
    SV-CMA-ES approximates the ground truth samples and results by gradient-based SVGD (blue line) the best out of all gradient-free methods.
    }
    \label{fig:mmd}
\end{figure*}

\paragraph{Setting} 
We first evaluate our method on multiple synthetic densities to illustrate the quality of the generated samples. 
The closed form pdf for every problem is listed in \Cref{secExpDets}.
We use a total population size of 400 for all methods, which is split across 100 particles for the ES-based algorithms. 
In other words, each particle samples an ES population of 4.
Following common practice in the literature, we quantify sampling performance by evaluating the Maximum Mean Discrepancy \citep[MMD]{gretton2012kernel} of the particles with respect to ground truth samples.
We additionally evaluate the scaling to higher particle numbers in Sec.~\ref{sec:abl}.
% For this analysis, we report the final performances for different numbers of particles, as well as population splits for the ES-based methods.

\paragraph{Results} 
Figures \ref{fig:samples} and \ref{fig:mmd} display the qualitative and quantitative sampling results.
As expected, $\N$-SVGD generates high-quality samples for all problems.
We find that among the gradient-free methods, SV-CMA-ES performs the best across all problems.
While GF-SVGD generates high-quality samples for the Gaussian mixture, the variance of the generated samples on the double banana density is too high, and the samples for the motion planning problem are of poor quality.
Concurrently, SV-OpenAI-ES performs well on the motion planning problem, but on the others it underestimates the variance (Gaussian mixture) or converges slowly (double banana, also see Fig.~\ref{fig:banana_convergence_iter}).
These results highlight the fast convergence properties of our method, as it employs the automatic step-size adaptation of CMA-ES.
Additionally, \Cref{fig:mmd}~(d) displays the MMD \wrt the samples that were obtained by $\N$-SVGD, aggregated across all sampling tasks.
Our results demonstrate that SV-CMA-ES can indeed quickly converge to a set of samples that approximates the outcomes of $\N$-SVGD well, and better than the two other gradient-free baselines.
We further illustrate these results in \cref{sec:app_full} where we display the sample sets for all sampling tasks.
Moreover, we illustrate the benefit of using the presented algorithm compared to other CMA-ES-based methods in Fig.~\ref{fig:sves-overview} -- since prior CMA-ES methods only maximize likelihood, the diversity of samples is low.

\subsection{Bayesian Logistic Regression}
\paragraph{Setting} Next, we evaluate our method on Bayesian logistic regression for binary classification.
We follow the setup of \citet{langosco2021neural}, which uses a hierarchical prior $p(\theta)$ on the parameters $\theta = [\alpha, \beta]$, where $\beta \sim \NN (0, \alpha^{-1})$ and $\alpha \sim \Gamma (a_0, b_0)$.
Given data $D$, the task is to approximate samples from the posterior
\begin{align*}
    &p(\theta \mid D) = p(D \mid \theta)p(\theta) \quad \text{ with:}\\
    &p(D \mid \theta) = \prod_{i=1}^N \bigl[ y_i \tfrac{\exp (x_i^T \beta)}{1 + \exp (x_i^T \beta)} + (1- y_i) \tfrac{\exp (-x_i^T \beta)}{1 + \exp (-x_i^T \beta)} \bigr].
\end{align*}
We consider the binary \textit{Covtype}, \textit{Spambase}, and the \textit{German credit} datasets from the UCI Machine Learning Repository \citep{asuncion2007uci}, as suggested in prior work \citep{liu2016stein, arenz2020trust, futami2018variational}. 
For all experiments, we use a total population of 256, which is split across 8 particles for the ES-based methods.

\paragraph{Results} 
For each dataset, we report the accuracy and negative log-likelihood (NLL) across the entire particle set, and report the mean performance across 10 runs.
Our results demonstrate that SV-CMA-ES outperforms the remaining gradient-free algorithms.
On both datasets, our method is the fastest converging among the gradient-free methods.
Furthermore, its final performance is considerably better than GF-SVGD on all datasets.
While the performance of $\N$-SVGD is slightly better on the Covtype dataset, SV-CMA-ES is on par with it for the Spam dataset.
Additionally, on the credit data, we find that ES-based methods are both more accurate and exhibit greater stability than the gradient-based SVGD, which underlines the potential of zero-order methods in this context.

\begin{figure}[!ht]
    \centering
    \subfloat{
        \includegraphics[width=.3\linewidth]{imgs/log_reg/covtype10rep1000iter8pop32accuracy}
    }
    \subfloat{
        \includegraphics[width=.3\linewidth]{imgs/log_reg/spam10rep1000iter8pop32accuracy}
    }
    \subfloat{
        \includegraphics[width=.3\linewidth]{imgs/log_reg/credit10rep1000iter8pop32accuracy}
    }\\
    \setcounter{subfigure}{0}
    \subfloat[Covtype]{
        \includegraphics[width=.3\linewidth]{imgs/log_reg/covtype10rep1000iter8pop32test_nll}
    }
    \subfloat[Spam]{
        \includegraphics[width=.3\linewidth]{imgs/log_reg/spam10rep1000iter8pop32test_nll}
    }\subfloat[Credit]{
        \includegraphics[width=.3\linewidth]{imgs/log_reg/credit10rep1000iter8pop32test_nll}
    }
    \caption{
        Results of Bayesian logistic regression. 
        We report mean ($\pm 1.96$ standard error) across 10 independent runs. 
        SV-CMA-ES converges the faster than other gradient-free methods, and achieves similar performance levels at convergence as gradient-based SVGD (dashed line).
    }
    \label{fig:log_reg}
\end{figure}

\subsection{Reinforcement Learning}\label{secExpRl}
\paragraph{Setting} 
We further assess the performances of the gradient-free SVGD methods on six classic reinforcement learning (RL) problems.
The goal of each RL task is to maximize the expected episodic return $J(\theta)$, where each particle $\theta$ now parametrizes a multi-layer perceptron (MLP).
The corresponding inference objective is to sample policy parameters $\theta$ from the following Boltzmann distribution:
\begin{equation*}
    p(\theta) \propto \exp(J(\theta)), \quad J(\theta) = \EEE_{(s_t, a_t) \sim \pi_{\theta}} \bigl[\sum_{t=1}^T r(s_t, a_t)\bigr]
\end{equation*}
where $(s_t, a_t) \sim \pi_{\theta}$ represent a trajectory sampled from the distribution that is induced by the policy parametrized by $\theta$.
For each problem, we train a 2-hidden layer MLP with 16 units per layer, which implies high-dimensional optimization problems as each MLP has several hundred parameters.
The specific numbers vary across the benchmarks and are listed in \Cref{table:hyperparams} in the Appendix.
We use a total population size of 64 which we split into 4 subpopulations for the ES-based methods and estimate the expected return across 16 rollouts with different seeds.
To make the results comparable to other works on ES for RL, we follow the approach of \citet{lee2023stamp} and extend the optimization by a phase that attempts to find exact optima.
We realize this by fading out the repulsive term via the schedule $\gamma (t) = \log (T / t)$.

\begin{figure}[ht]
    \centering
    \subfloat[Pendulum]{
        \includegraphics[width=.32\linewidth]{imgs/rl/Pendulum-v110rep200iter4pop16val_new}
    }
    \subfloat[CartPole]{
        \includegraphics[width=.3\linewidth]{imgs/rl/CartPole-v110rep200iter4pop16val_new}
    }
    \subfloat[MountainCar]{
        \includegraphics[width=.3\linewidth]{imgs/rl/MountainCarContinuous-v010rep200iter4pop16val_new}
    }\\[-.5em]
    \subfloat[Halfcheetah]{
        \includegraphics[width=.3\linewidth]{imgs/rl/halfcheetah10rep1000iter4pop16val_new}
    }
    \subfloat[Hopper]{
        \includegraphics[width=.3\linewidth]{imgs/rl/hopper10rep1000iter4pop16val_new}
    }
    \subfloat[Walker]{
        \includegraphics[width=.3\linewidth]{imgs/rl/walker2d10rep1000iter4pop16val_new}
    }
    \caption{Results of sampling MLP parameters for RL tasks. 
    Plotted is the best expected return across all particles for each method. 
    We report the mean ($\pm 1.96$ standard error) across 10 independent runs.
    SV-CMA-ES performs better than the gradient-free baselines across all tasks.
    }
    \label{fig:rl}
\end{figure}

\paragraph{Results} 
We display the aggregated results across all RL tasks in Fig.~\ref{fig:sves-overview}, and the individual task performances in Fig.~\ref{fig:rl}.
Our results showcase a strong performance of SV-CMA-ES.
In comparison to other gradient-free versions of SVGD, it is the only method that generates high scoring solutions for all problems.
In particular, we observe that SV-CMA-ES is the only method that solves the MountainCar problem consistently, while it is the fastest to converge on Pendulum.
Both of these environments feature a local optimum at which agents remain idle to avoid control costs \citep{eberhard2023pink}.
It is on these problems that GF-SVGD converges to such optima in certain runs, which we further illustrate in Fig.\ \ref{fig:mc_seeds} in the Appendix.
These results illustrate that SV-CMA-ES improves over GF-SVGD by sampling stochastic ES steps, which leads to a higher exploration of the domain.
Interestingly, our results further show that SV-OpenAI-ES may deliver good samples in some runs, but the high standard error on several problems underline its sensitivity to initialization.
These findings confirm that SV-CMA-ES is a strong gradient-free SVGD scheme, capable of sampling from densities and optimizing blackbox objectives.
Further, we would like to note that our final performances are comparable to those reported in prior, gradient-based, work \citep{jesson2024relu}, which again underlines the potential of our method.

Further, we analyze the benefits of the kernel term by comparing our method to uncoordinated parallel runs of CMA-ES.
Overall, we observe a clear performance improvement when using the kernel term.
In particular, in the more challenging Hopper and Walker tasks, the benefits of using SV-CMA-ES over parallel CMA-ES are large.
We extend our analysis of SV-CMA-ES in Appendix \ref{secAblations} where we compare it to vanilla CMA-ES and OpenAI-ES, and conduct additional experiments on sparse reward environments.
This analysis reveals that SV-CMA-ES consistently outperforms competing ES, underscoring its superior performance in environments where effective exploration is essential.


\subsection{Ablation Studies}\label{sec:abl}
\paragraph{Choice of Population Size}
In the experiments above, we investigate the performance for fixed particle numbers and population sizes.
To gain further insights into the scalability of our method, we conduct an additional analysis on the same sampling problems as in \Cref{secExpSynth} using varying particle numbers.
The results of these experiments are displayed in \Cref{fig:scaling}.
In addition to the MMD after $1\,000$ iterations, we report the error when estimating the first two central moments of the target distribution from the generated samples.
We observe the clear trend that SV-CMA-ES performs better than GF-SVGD and SV-OpenAI-ES with increasing particle numbers.
Furthermore, we observe in Fig.~\ref{fig:scaling} (d) that SV-CMA-ES requires fewer samples than SV-OpenAI-ES to estimate good steps.

\paragraph{Choice of Annealing Schedule}
In the experiments above, we used annealed SVGD. 
This decision was made due to its widespread use in the community and many desirable properties. 
However, to assess the quality of our method, it is important to consider the sensitivity to the choice of annealing schedule.
In Table~\ref{table:annealing}, we show the key performance metrics from 10 seeds for SV-CMA-ES with and without annealing.
As we see, performance in all cases is strong, with little difference between the two conditions.

\begin{figure*}[ht]
    \centering
    \subfloat[MMD w.r.t.\ ground truth]{
        \includegraphics[width=0.19\linewidth]{imgs/scaling/scaling_aggregated_lines_mmds}
    }\hfill
    \subfloat[{$\EEE [x]$}]{
        \includegraphics[width=.19\linewidth]{imgs/scaling/scaling_aggregated_lines_means}
    }\hfill
    \subfloat[{$\VVV [x]$}]{
        \includegraphics[width=.19\linewidth]{imgs/scaling/scaling_aggregated_lines_vars}
    }\hfill
    \subfloat[Subpopulation Scaling]{
        \raisebox{.1cm}{\includegraphics[width=.29\linewidth]{imgs/scaling/aggregated_grid_mmds_mean_abc}}
    }\hfill
    \caption{ Scaling analysis. Depicted are the final performances for different total population sizes. \textbf{(a)}: MMD vs.\ sample size after 1000 iterations. \textbf{(b)-(c)}: MSE vs.\ sample size when estimating the first two central moments of the ground truth distribution. 
    For ES, we use the same subpop.\ size per particle as in \Cref{fig:mmd}. \textbf{(d)}: Subpopulation size scaling for ES-based SVGD. 
    The results are averaged across 10 independent runs of all synthetic sampling tasks from \Cref{fig:mmd}.
    SV-CMA-ES performs the best out of all gradient-free methods (solid lines) across different particle numbers.
    }
    \label{fig:scaling}
\end{figure*}

\begin{table}[h]
\centering
\resizebox{\columnwidth}{!}{%
    \begin{tabular}{l|ll}
    \toprule
    \textbf{Task} & \textbf{Annealing} & \textbf{No Annealing} \\
    \midrule
    GMM & -3.03 (0.30) & -2.92 (0.26) \\
    Double Banana & -2.59 (0.16) & -2.83 (0.21) \\
    Motion Planning & -2.40 (0.10) & -2.44 (0.13) \\
    Covertype NLL & 0.59 (0.01) & 0.59 (0.01) \\
    MountainCar RL & 93.68 (0.10) & 93.68 (0.08) \\
    Hopper RL & 1781.31 (132.64) & 1788.13 (79.22) \\
    \bottomrule
    \end{tabular}
}
\caption{Kernel annealing ablation. This table shows SV-CMA-ES performances across 10 seeds with 1.96 standard error in parentheses. All runs use the identical setup, aside from the kernel annealing. No annealing means that we use a constant $\gamma(t)=1$.}\label{table:annealing}
\end{table}
