% \documentclass{uai2025} % for initial submission
\documentclass[accepted]{uai2025} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

% My packages
\usepackage{algorithm}
\usepackage{algpseudocode}

\usepackage{tabularx}
\usepackage{multirow}

\newcommand{\hau}[1]{\textcolor{blue}{[Hau: #1]}}
%% Self-defined macros
\input{commands}



\title{
Geodesic Slice Sampler for Multimodal Distributions with Strong Curvature}


% The standard author block has changed for UAI 2025 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<bernardo.williamsmoreno@helsinki.fi>?Subject=Geodesic Slice Sampler for Multimodal Distributions with Strong Curvature}{Bernardo Williams}{}}
\author[1]{Hanlin Yu}
\author[1]{Hoang Phuc Hau Luu}
\author[2]{Georgios Arvanitidis}
\author[1]{Arto Klami}
% Add affiliations after the authors
\affil[1]{%    
    Department of Computer Science, University of Helsinki, Finland    
}
\affil[2]{%
    Cognitive Systems, DTU Compute, Technical University of Denmark
}
  
\begin{document}

% Do not add contents to table of contents
\addtocontents{toc}{\protect\setcounter{tocdepth}{0}}

    \maketitle
    
\begin{abstract}    
Traditional Markov Chain Monte Carlo sampling methods often struggle with sharp curvatures, intricate geometries, and multimodal distributions. Slice sampling can resolve local exploration inefficiency issues, and Riemannian geometries help with sharp curvatures. Recent extensions enable slice sampling on Riemannian manifolds, but they are restricted to cases where geodesics are available in a closed form. We propose a method that generalizes Hit-and-Run slice sampling to more general geometries tailored to the target distribution, by approximating geodesics as solutions to differential equations. Our approach enables the exploration of the regions with strong curvature and rapid transitions between modes in multimodal distributions. We demonstrate the advantages of the approach over challenging sampling problems.
\end{abstract}

\section{Introduction}\label{sec:intro}

Sampling from a differentiable unnormalized log-density defined on a Euclidean space is a core problem in machine learning and statistics. While gradient-based Markov Chain Monte Carlo (MCMC) methods have proven effective in many scenarios, they often face significant challenges when the target distribution exhibits complex geometry (sharp curvature) or multimodal behavior. The two core challenges are largely addressed with complementary techniques, with little work on algorithms that excel for targets that are \emph{both} multimodal and complex in shape.

Complex shapes and sharp curvatures are often addressed by using a suitably chosen Riemannian geometry within the sampling algorithms \citep{Girolami2011}. Instead of operating in a Euclidean space and metric, the samplers carry out the necessary operations using a metric that adapts to the curvature of the parameter space. In practice, the methods follow flows induced by the metric, in most cases by numerical integration, and consequently the methods are sometimes called \emph{geodesic} methods as in our title. Various practical metrics and sampling algorithms have been shown to improve the sampling of targets with strong curvature \citep{Girolami2011,Byrne2013,Lan2015,Hartmann2022,Hartmann2023,Williams2024}, albeit always with increased computational cost.

Multimodality, in turn, is most commonly addressed by tempering or diffusion techniques \citep{Earl2005,Chen2024}. These methods use a tempered (smoothed) version of the target to improve exploration over multiple modes, intuitively changing the problem itself so that the modes are connected with areas of sufficient probability. At a high degree of tempering these methods can efficiently explore the different modes, but low tempering is needed for accurate sampling within the modes, necessitating adaptive or parallel sampling with different degrees of tempering.
The efficiency of parallel tempering depends on the swap acceptance rate between adjacent temperatures, which can decrease in high dimensions if the temperature schedule is not well-tuned \citep{Woodard2009}.  
Diffusion-based approaches, in turn,  require careful choice of the noise schedule to balance exploration and accuracy \citep{Song2019, Chen2024}. Unlike tempering, diffusion methods can achieve smooth transitions between modes without explicitly maintaining a set of parallel chains, but the acceptance rate of noisy samples can be low \citep{Chen2024}.

Even though the two approaches are efficient in addressing the two challenges separately, there is very little work on samplers designed for the general setup where both difficulties may arise simultaneously. One could consider e.g. parallel tempering in a Riemannian metric --- see \citet{Byrne2013} for a rare example in this intersection --- but ideally we would like to address both aspects using a common mechanism. This work explores one such approach, developing a Riemannian sampler capable of efficiently exploring multiple modes, without any tempering for the target distribution. Instead, we seek to improve mode exploration by changing  the metric, in the spirit of the early work by \citet{Lan2014} that developed a specific metric solely for this purpose. Their metric, however, requires explicit identification and tracking of the modes and is more like a conceptual demonstration, and we are not aware of any other works aiming for efficient multimodal samplers solely by the change of the metric.

% We are motivated by the idealized slice sampler, with computable level sets.
% The performance of such an idealized sampler is theoretically independent of the sampling problem, as pointed out by \citet{Durmus2023}:
% \textsl{“This means that the performance of the idealized slice sampler is ignorant of the introduction of, e.g., multimodality, local modes, or anisotropy as long as the volume of the level sets is not modified.”}
% If we change the geometry of the problem such that level sets are easier to compute, then the slice sampler should benefit.
We are motivated by the idealized slice sampler with computable level sets. As noted by \citet{Durmus2023}:
\textsl{“This means that the performance of the idealized slice sampler is ignorant of the introduction of, e.g., multimodality, local modes, or anisotropy as long as the volume of the level sets is not modified.”}
This insight suggests that by modifying the geometry of the problem to produce simpler or more tractable level sets, the slice sampler can effectively handle multimodal distributions.
%
From a practical perspective, we build on the (Euclidean) Hit-and-Run slice sampler by \citet{Belisle1993}, which at each iteration selects a random direction and then samples from the resulting one-dimensional distribution formed by the intersection of the line and the slice. 
In effect, it transforms multi-dimensional sampling into sequential one-dimensional sampling tasks, but the overall sampler may be inefficient. Especially in higher dimensions, the intersection with the slice can be small for almost all directions \cite{Murray2010}.

Both \citet{Habeck2023} and \citet{Durmus2023} recently considered generalizations of the Hit-and-Run sampler for Riemannian manifolds, replacing the lines with geodesics. We build on the general algorithmic framework introduced by \citet{Durmus2023} and adapt it to the task of sampling from a distribution with a complex geometry. Specifically, we begin by embedding the (Euclidean) sampling space into a higher-dimensional space that incorporates the target distribution’s geometric information, such as Fisher information or Monge embedding \citep{Hartmann2022}. This transforms the problem into sampling from a particular Riemannian manifold where the target distribution corresponds to the Hausdorff density (see Section \ref{sect:approx_geo}). 
Note that even though we leverage components proposed by \citet{Durmus2023}, our task is fundamentally more difficult. Their starting point was sampling of a density on a known manifold (e.g., a sphere) where the geodesics are  exactly known, whereas 
the complexity of our embedding manifold requires us to approximate the geodesics using numerical integrators.

\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{figs/figure1/euclidean_.png}
    \includegraphics[width=\linewidth]{figs/figure1/inverse_generative_1.0.png}
    \includegraphics[width=\linewidth]{figs/figure1/inverse_generative_0.1.png}
    \includegraphics[width=\linewidth]{figs/figure1/inverse_generative_0.01.png}
    \caption{Illustration of the step-out procedure in \ournamefull.
    The lines drawn with different colors represent the Hausdorff density $p(t) := p_{\cH}(\hat{\gamma}_{(\bx, \bbv)}(t))$ (Eq.~\eqref{eq:hausdensity}) considering the Inverse Generative metric for different values of $\lambda$ (Eq.~\eqref{eq:gen}).
    The step-out procedure chooses  $s \sim \mathrm{Unif}(0,p(0))$ and sets randomly an interval of length $r-\ell$ at $t=0$ with left and right points $\ell$ and $r$. While $p(r) > p(s)$ it expands the right side of the interval as $r = r + w$ and for the left side while $p(l) > p(s)$ it does $\ell = \ell - w$. This expands the length of the initial interval.     
    As $\lambda \rightarrow 0$ the space shrinks due the properties of the metric, making it easier for the step-out procedure to jump to the distant mode.        
    }
    \label{fig:fig1}
\end{figure}

In this work, we propose a geodesic slice sampler applicable for arbitrary Riemannian metrics, and discuss the choice of the metric. In particular, we introduce two new computationally efficient metrics. Both metrics improve sampling over multimodal targets by, in a sense, pulling the modes closer to each other; see Figure~\ref{fig:fig1} illustrating this effect within the slice sampler, as a function of a parameter $\lambda$ controlling how much the metric warps the space.
% see Figure~\ref{fig:fig1} and Figure~\ref{fig:fig2} illustrating this effect. The former within the slice sampler in a single dimensional space, as a function of a parameter $\lambda$ controlling how much the metric warps the space. And the latter in a two dimensional space.
In addition, we introduce a meta-sampler similar to \citet{Tjelmeland2001} that combines the proposed method with a separate sampler for improved exploration of local modes.

We empirically demonstrate  improved sampling over Euclidean methods for complex targets, and highlight improved mixing over multiple modes in high dimensional-cases when compared against  parallel tempering \citep{Swendsen1986,Latuszynski2025} and the diffusive Gibbs sampler by \citet{Chen2024} designed for addressing multimodality. Similar to previous Riemannian methods, the algorithm shows good exploration and mixing, but has slower iterations because of the numerical computation of the geodesics. 

\section{Background: Slice sampling}

The classic work of \citet{Neal2003} introduces slice sampling as a method for generating samples by uniformly sampling from the $\R^{D+1}$ manifold defined by the graph of the probability density.
%
Let $p(\bx)$ be an unnormalized continuous target density that satisfies $\int p(\bx) \dbx < \infty$. Suppose that direct sampling from $p(\bx)$ is not feasible.
We consider densities where $\bx \in \R^D$ with respect to the Lebesgue measure.

Idealized slice sampling defines a uniform distribution over the volume under the graph of $p(\bx)$ and generates samples through the following two steps: 
\begin{enumerate}
    \item Sample $s \sim \mathrm{Unif}(0, p(\bx))$.
    \item Sample $\bx \sim \mathrm{Unif}( L(s) )$.
\end{enumerate}
where the slice is given by $L(s) := \{ \bx \mid p(\bx) > s \}$. For special cases, such as log-concave or rotationally invariant densities, the slice sampler has theoretical performance guarantees \citep{Natarovskii2021}. However, for more complex distributions, drawing uniform samples from $L(s)$ is often impractical \citep{Rudolf2018}.

To address this, the step-out and shrinkage procedures are used. Below, we provide an informal explanation of these procedures. The full algorithm is detailed in the Appendix (Algorithms \ref{alg:stepout} and \ref{alg:shrink}).
%
Both procedures were first introduced by \citet{Neal2003}, but we adopt an equally valid modified version of the shrinkage step as proposed by \citet{Durmus2023}.
%Both versions are equally valid. 
For a moment, assume a univariate density $p(x)$ and a current position $x \in \R$. The procedures are as follows:


\paragraph{The Step-Out Procedure}  
The step-out procedure, illustrated in Figure \ref{fig:fig1}, takes two parameters: the width $w \in \R$ and maximum steps $m \in \N$. Given the slice $L(s)$, the goal is to expand an interval around the current point $x$.  Consider the auxiliary function $\gamma_{x}(t)=x + t$.

The initial left $\ell$ and right $r$ points are set at a random distance $w$ apart. This is done by sampling $u \sim \mathrm{Unif}(0, w)$ and setting $\ell = -u$ and $r = \ell + w$. To ensure that at most $m+1$ expansion steps are performed (combined for both directions), a random integer $\iota \sim \mathrm{Unif}(\{1, \dots, m\})$ is sampled. The right limit is expanded up to $\iota$ times, and the left limit up to $m+1-\iota$ times.  

The expansion proceeds as follows:
The right limit $r$ is expanded by adding $w$ until $p(\gamma_x(r + w)) < s$, meaning $\gamma_x(r + w) \notin L(s)$. The left limit $\ell$ is expanded by subtracting $w$ until $p(\gamma_x(\ell - w)) < s$, meaning $\gamma_x(\ell - w) \notin L(s)$.  
The procedure returns the updated interval $(\ell, r)$. We denote it by $\text{Step-out}_{w,m}(s, \gamma_x)$.

\paragraph{The Shrinkage Procedure}  
The shrinkage procedure selects a sample from the interval $(\ell, r)$ by gradually reducing its size until a point is found within $L(s) \cap (\ell, r)$.  

The interval $J = (\ell, r)$ is treated as a circular domain, meaning that if we move past $r$, we continue from $\ell$. The procedure starts by sampling two points $y$ and $z$ uniformly within $(\ell, r)$. If neither $\gamma_x(y)$ or $\gamma_x(z)$ fall inside $L(s)$, the interval is shrunk as follows:  
\begin{itemize}
    \item Form the interval $(y\wedge z, y\vee z )$. Update the circular region by
    \begin{equation*}
        J =
        \begin{cases} 
        J\cap(y \wedge z, y \vee z), & \text{if } 0 \in J, \\
        J \setminus (y \wedge z, y \vee z), & \text{if } 0 \notin J.
        \end{cases}        
    \end{equation*}    
    \item Set $y=z$ and  update $z\sim \mathrm{Unif}(J)$.  
    \item This process repeats, each time reducing the size of the interval, until $\gamma_x(z) \in L(s)$.
\end{itemize}
We denote the procedure $\text{Shrink}_{l,r}(s, \gamma_x)$.
One complete step of the slice sampler is:
\begin{enumerate}
    \item Sample $s \sim \mathrm{Unif}(0,p(x))$
    \item Obtain $\ell,r = \text{Step-out}_{w,m}(s,\gamma_x)$ 
    \item Sample $t^* = \text{Shrink}_{\ell,r}(s, \gamma_x)$.
    \item Set $x = \gamma_x(t^*)$.
\end{enumerate}


\paragraph{Hit-and-Run} 
One way to extend slice sampling to multivariate distributions is to combine it with  Hit-and-Run sampling,
 presented here following \citet{Belisle1993}. Let $ \dS^{D-1}(\bx) = \{\bbv \in \R^D : \norm{\bbv}^2 =1\}$, and  $\bbv \sim \mathrm{Unif}( \dS^{D-1}(\bx) )$. An iteration of the whole sampler is: 
\begin{itemize}
    \item Obtain $\bbv\sim \mathrm{Unif}( \dS^{D-1}(\bx) )$.
    \item Obtain a sample from the density evaluated on the straight line (Euclidean geodesic) 
    \begin{equation*}
     t \mapsto p(\bx + t \bbv)/ \int p(\bx + t \bbv) \dd t.
    \end{equation*}
    
    
\end{itemize}
When directly sampling a value $t$ according the density along a straight line $t \mapsto  p(\bx + t \bbv)/ \int p(\bx + t \bbv) \dd t$ is not feasible, we can use slice sampling on $t \mapsto  p(\bx + t \bbv)$, since it is an unnormalized univariate distribution. Define $\gamma_{(x,v)}(t) = \bx + t \bbv$.
The step-out procedure outputs $\ell,r = \text{Step-out}_{w,m}(s, \gamma_{(x,v)})$ and
the shrinkage procedure will return $t^* = \text{Shrink}_{\ell,r}(s, \gamma_{(x,v)})$. The new sample is $\bx =  \gamma_{(x,v)}(t^*)$.
This is called  Hit-and-Run slice sampling or hybrid slice sampling \citep{Latuszynski2014}. This method extends slice sampling to probability distributions defined over $\R^D$.


\section{Method}
%\ournamefull}
\label{sect:approx_geo}
Our main contribution is a geodesic slice sampler that can accommodate arbitrary metrics. It extends the Hit-and-Run slice sampler described above for non-Euclidean geometries, similar to the recent works of \citet{Durmus2023} and \citet{Habeck2023}, but instead of leveraging closed-form analytic geodesics of predefined manifolds we induce metrics using characteristics of the target density itself to guide the sampling. Now the geodesics need to be approximated by numerical integrators.
This section explains the sampler and a meta-sampler that combines the core method with separate local sampler for improved efficiency for general metrics, always using $\bG(\bx)$ to denote the metric tensor. We will discuss specific metrics in Section~\ref{sec:metrics}.


Straight-line Hit-and-Run sampling can be inefficient because proposals often move away from high-probability regions \citep{Murray2010}. To resolve this, we perform slice sampling along geodesic curves that can accommodate the geometry of the target distribution. This improves efficiency when the target distribution is highly curved or multimodal; see Section~\ref{sec:metrics_global}. 
We are interested in sampling problems defined in $\mathbb{R}^D$ but allow using different plug-in metrics (preferably using the target density information) to enhance exploration. 
% This can be cast as sampling from a distribution defined on a Riemannian manifold where the algorithm in \citet{Durmus2023} can be employed. 

The general problem can be cast as sampling from a distribution defined on a Riemannian manifold where we adapt the algorithm of \citet{Durmus2023} under general  metrics.
Alternatively, it can be seen as Hit-and-Run where straight lines are replaced by curves that better wrap around the level sets of the target density (given the metrics are good enough). Because the metric is general, closed-form geodesics are unavailable, so we must compute them with numerical integrators.
To correctly sample along geodesics, we need three key components:  Adjusting for the correct density on the manifold,  properly sampling directions using the Riemannian metric, and solving the geodesic equations. 

\paragraph{Hausdorff Density: } 
To ensure we sample from the correct distribution on the manifold with metric $\bG(\bx)$, we must account for the change in measure from the Euclidean space to the manifold. The correct density is the Hausdorff density
\begin{equation} 
  p_{\cH}(\bx) = \frac{p(\bx)}{\sqrt{\det \bG(\bx)}}. \label{eq:hausdensity}
\end{equation}
The denominator adjusts for local volume distortion introduced by the metric $\bG(\bx)$, ensuring that the volume over the manifold is preserved and hence maintaining proper sampling behavior. See Appendix \ref{app:hausdorff} for further details.

\paragraph{Sampling from the Riemannian Unit Ball: } 
%Here we explain the mechanism found in \citep{Durmus2023}.
Instead of sampling a random direction in Euclidean space, we must now sample from the unit geodesic ball under the Riemannian metric, where we can directly use the method proposed by \citet{Durmus2023}. Given a position $\bx$, a velocity $\bbv$ is sampled as follows: First draw $\bbv \sim \mathcal{N}(\mathbf{0}, \bG^{-1}(\bx))$, and then normalize it to obtain a unit-length vector in the Riemannian metric with:
\begin{equation*}
\bbv \leftarrow \frac{\bbv}{\|\bbv\|_g}, \quad \text{where} \quad \|\bbv\|_g = \sqrt{\bbv^\top \bG(\bx) \bbv}.    
\end{equation*}
This ensures that the direction is uniformly distributed on the unit sphere under the metric $\bG(\bx)$. See Appendix \ref{app:agss} for additional implementation details.

\paragraph{Approximating Geodesic Curves}
Given a sampled velocity $\bbv$, we need to follow the geodesic curve starting at $\bx$ in direction $\bbv$. In general, the geodesic equation
\begin{align}
    \dot \bx_k &= \bbv_k, \nonumber \\
    \dot \bbv_k &= - \norm{\bbv}^2_{\Gamma^k}, \quad \mathrm{for}\ k = 1, \ldots, D. \label{eq:geoeqs}
\end{align}
where $\Gamma^k_{ij} = \tfrac{1}{2}g^{km} ( \partial_i g_{m j} +  \partial_j g_{i m} - \partial_m g_{i j})$,
does not have a closed-form solution for arbitrary $\bG(\bx)$. See more detail in appendix \ref{app:geoeq}.
Instead, we numerically approximate the exponential map $\gamma_{(\bx, \bbv)}(t)$ by solving these differential equations with an ordinary differential equation (ODE) solver, denoted as $\hat{\gamma}_{(\bx, \bbv)}(t)$. 
The choice of the metric determines the shape of geodesic trajectories, allowing the sampler to adapt to different target distributions; see Section~\ref{sec:metrics}.

Algorithm \ref{alg:agss} explains the full \ournamefull \ (\ourmethod). After sampling a velocity $\bbv$ from the unit Riemannian sphere, slice sampling is performed on the Hausdorff density evaluated along the numerical solution of the geodesic trajectory. The step-out and shrinkage procedures then determine the final sample.
\begin{algorithm}[H] 
    \caption{\ournamefull}
    \label{alg:agss}
    \textbf{Input:} Initial position $\bx^{[0]}$, metric tensor $\bG(\bx)$, and parameters $m\in \mathbb{N}$, $w\geq 0$.\\
    \textbf{Output:} $N$ samples $\bx^{[n]}$.
    \begin{algorithmic}[1]                        
        \For{$n \leftarrow 0, \dots, N-1$}
            \State Sample $s \sim \mathrm{Unif}(0, p_{\cH}(\bx^{[n]}))$
            \State Sample velocity $\bbv^{[n]} \sim \textrm{Unif}(\dS_g^{D-1}(\bx^{[n]}))$      
            \State Compute  $(\ell, r) = \text{Step-out}_{w,m}(s, \hat{\gamma}_{(\bx^{[n]}, \bbv^{[n]})})$ 
            \State Sample time $t^* = \text{Shrink}_{\ell,r}(s, \hat{\gamma}_{(\bx^{[n]}, \bbv^{[n]})} )$
            \State $\bx^{[n+1]}= \hat{\gamma}_{(\bx^{[n]},\bbv^{[n]})}(t^*)$
        \EndFor
    \end{algorithmic} 
\end{algorithm}

\subsection{Meta Sampler and Multimodality}

The sampler as described above is valid as such, but we also introduce a simple extension that can further improve sampling for multimodal targets with complex local structure.

Following \citet{Tjelmeland2001,Latuszynski2025}, we create a \emph{meta-sampler} that alternates between using \ourmethod\ for global moves and an arbitrary local MCMC for sampling within each mode.
To generate one sample, we first run $K$-steps of \ourmethod\ followed by $L$-steps of any local MCMC sampler. 
We refer to this combined strategy as \metaourmethod, detailed in Algorithm~\ref{alg:metaagss} (in Appendix). 
%
The main motivation for this hybrid strategy is to leverage gradient-based algorithms for fast exploration of the mode, to utilize their efficient mixing and fast per-iteration computation when they are sufficiently good for the local target. We could in principle use any sampler for the local part, including Riemannian samplers, but we in practice use standard Euclidean Metropolis-adjusted Langevin Algorithms (MALA) \citep{Roberts1996} in our experiments.


%\subsection{Detailed Balance}

% The detailed balance follows from  \citet{Durmus2023}, but additional care is needed because the geodesics need to be computed numerically. We state the key theorem below, with the proof and additional discussion in Appendix~\ref{app:proof}.

% %However, unlike \citet{Durmus2023}, which focuses on the case where the geodesics are given in closed-form, in our case the geodesics have to be obtained through numerical integrations.
% \mainthm \label{thm:detailed_balance}
% %In appendix \ref{app:proof} we discuss the validity of our method.
 


\section{Metrics} \label{sec:metrics}

The sampler is general, applicable for an arbitrary metric and only requiring $\bG(\bx)$ to be positive definite and vary continuously. By selecting an appropriate metric we can influence how the geodesics explore the space, controlling the overall sampling behavior. There is no single metric that is optimal for all targets, and the metrics proposed in the literature are motivated by complementary argumentation, with notable emphasis in computational efficiency.

Next we discuss the metric choice.
The literature has exclusively focused on metrics that improve local exploration for complex target distributions, with several practical solutions that we re-cap in Section~\ref{sec:metrics_local}. We then turn our attention on how to improve exploration of multiple modes, presenting novel metrics specifically designed for this in Section~\ref{sec:metrics_global}.

\subsection{For adapting to local curvature}
\label{sec:metrics_local}

\paragraph{The Fisher metric}
The Fisher Information Metric (FIM) is defined as the covariance of the score function, and was predominantly used in the early Riemannian methods \citep{Girolami2011} due to its close connection to estimation theory. A general form of the metric is:
%
\begin{equation*}
\bG_F(\bx)  = \Ex_{\by|\bx}\left[ \nabla_{\bx} \log p(\by | \bx)  \nabla_{\bx} \log p(\by | \bx)  ^\top \right],
\end{equation*}
%
but the specific form depends on the underlying problem, due to integration over the conditional density. Furthermore, it requires direct matrix inversion for computing $\bG_F^{-1}(\bx)$ that is required during geodesic computations (Eq.~\ref{eq:geoeqs}), with complexity of $\mathcal{O}(D^3)$. This makes the metric impractical and inefficient for high-dimensional problems.

\paragraph{The Monge Metric}
The computational cost of solving the geodesic equations (Eq.~\ref{eq:geoeqs}) is primarily determined by the inversion of the metric tensor, and consequently metrics with closed-form inverse offer significant savings. The Monge metric by \citet{Hartmann2022} naturally arises from the geometry of the graph of log-density function when viewed as a submanifold embedded in $\mathbb{R}^{D+1}$. Let $\alpha^2 \geq 0$ and $\lambda \geq 0$. The Monge metric and its inverse are given by
% 
\begin{align}
    \label{eq:monge}
    \bG_M(\bx) &= 
    \bI_D
    + \alpha^2 \nabla \ell \nabla \ell^\top,  
    \nonumber \\
    \bG_M^{-1}(\bx) &= 
    \bI_D 
    - \frac{\alpha^2}{1+\alpha^2 \|\nabla \ell\|^2} \nabla \ell \nabla \ell^\top,
\end{align}
% 
where $\ell(x)=\ln p(x)$. As $\alpha^2 \to 0$, the metric reduces to the Euclidean metric $\bI_D$. The determinant required for computing the Hausdorff density (Eq.~\eqref{eq:hausdensity}) is $\det \bG_M(\bx) = 1 + \alpha^2 \|\nabla \ell\|^2$.
Figure~\ref{fig:geodesics} illustrates the exponential map of geodesic balls with increasing radius under the Monge metric. This metric adapts to the geometry of the target distribution, expanding regions based on the local structure of the density.

\paragraph{The Generative Metric}
Another efficient metric, the Generative metric that is proportional to the target density function, was recently proposed by \citet{Kim2024}. One of its advantages is that computing the Christoffel symbols $\Gamma^k_{ij}$ only requires first-order derivatives of the density, whereas the Monge metric (Equation~\ref{eq:monge}) introduces second-order terms. For scalars $p_0 > 0$ and $\lambda \geq 0$, the Generative metric and its inverse are:
\begin{align} \label{eq:gen}
    \bG_g(\bx) &= 
    \left(\frac{p_0 + \lambda}{p(\bx) + \lambda} \right)^2 
    \boldsymbol{I}_D, \\
    \bG_g^{-1}(\bx) &= 
    \left(\frac{p(\bx) + \lambda}{p_0 + \lambda}\right)^2 
    \boldsymbol{I}_D. 
\end{align}
As $\lambda \to \infty$, the metric reduces to the Euclidean metric. The determinant is  
$\det \bG_g(\bx) = \left(\tfrac{p_0 + \lambda}{p(\bx) + \lambda}\right)^{2D}$.
Figure~\ref{fig:fig1} illustrates the effect of $\lambda$ on the Hausdorff density along geodesics $t \mapsto p_{\mathcal{H}}\left(\hat{\gamma}_{(\bx, \bbv)}(t)\right)$, and Figure~\ref{fig:geodesics} again shows how the Generative metric transforms the space.


\begin{figure}[t]
    \centering
\includegraphics[width=0.49\linewidth]{figs/figure1/geodesic_balls_monge.png}
\includegraphics[width=0.49\linewidth]{figs/figure1/geodesic_balls_generative.png}
\caption{    
    Exponential map for Riemannian balls of increasing radius on the Funnel distribution for the Monge metric with $\alpha = 1$ on the left panel. On the right, the plot is analogous but considering the Generative metric with $\lambda=0.1$ and $p_0=0.1$. Each color represents a bigger radius from the base point ($\star$). Both metrics achieve the desired goal, shortening the distances to the points along the narrow funnel that would be difficult to reach in a Euclidean geometry.
    }
    \label{fig:geodesics}
\end{figure}



\subsection{For Bridging the Modes}
\label{sec:metrics_global}

The above metrics adapt for the local curvature and have been designed to improve sampling of, for instance, narrow funnels by re-defining the proximity (see Figure~\ref{fig:geodesics}). For assisting exploration of multimodal targets we need different kinds of metrics: Now we would want a metric that makes modes that are far away in the original Euclidean sense appear closer. 
% (see Figure~\ref{fig:geodesics_multimodal}). 
%
With the exception of the construction of \citet{Lan2014}, which we will discuss in Section~\ref{sec:related_work}, we are not aware of any previous metrics designed for this. Next, we introduce two computationally efficient metrics, with fast inverses and determinants, for assisting multimodal sampling. 

%We start by reminding that any matrix $\bG(\bx)$ defines a valid Riemannian metric as long as it is positive definite for every $\bx \in \man$ and varies continuously on $\man$. Since the inverse of a positive definite matrix is also positive definite, we observe that it is possible to use any of the previously formulated $\bG^{-1}(\bx)$ for defining a metric. 

Geodesic curves maintain a constant velocity norm in the Riemannian sense by construction.  Let $\bx_t=\gamma_{(\bx_0, \bbv_0)}(t)$ be a geodesic curve with velocity $\bbv_t=\dot\gamma_{(\bx_0, \bbv_0)}(t) $, starting from $\bx_0$ with initial velocity $\bbv_0$. If the geodesic moves toward a low-probability region where $p(\bx_t) \to 0$, then the ``mode bridging'' behavior occurs if the Euclidean velocity norms satisfy $\norm{\bbv_0}_2 \ll \norm{\bbv_t}_2$.
This means that as $t$ increases in low-density regions, the geodesics curves accelerate and locally pull the distant modes closer.  

We propose two metrics with the desired behavior, by leveraging the metrics described in Section~\ref{sec:metrics_local} in a novel way. Any matrix $\bG(\bx)$ defines a valid Riemannian metric as long as it is positive definite for every $\bx \in \man$ and varies continuously on $\man$. Since the inverse of a positive definite matrix is also positive definite, we observe that it is possible to use any of the previously formulated $\bG^{-1}(\bx)$ for defining a metric. This gives two new metrics that both help exploring multiple modes in different ways:
%Next, we explain the concrete metrics and their characteristics.
%concretely, we propose:

\paragraph{The Inverse Monge Metric} We use
%
\begin{align*}
\bG_{IM}(\bx) &= 
    \bI_D 
    - \frac{\alpha^2}{1+\alpha^2 \|\nabla \ell\|^2} \nabla \ell \nabla \ell^\top,
\end{align*}
%
as the metric tensor, with the inverse $\bG_{IM}^{-1}(\bx) = \bG_M(\bx)$ given by the previously introduced metric tensor of the standard Monge metric
(Eq.~\eqref{eq:monge}). The determinant of this metric is $1/\det(\bG_M)$, and hence it retains the computational efficiency of the original Monge metric.  Figure~\ref{fig:geodesics_multimodal} illustrates the geodesics emanating from one mode of a bimodal distribution under the Inverse Monge metric. The metric twists the curves towards the second mode and slightly increases acceleration (seen by the change of color). Observation~\ref{obs:invmonge} mathematically states the conditions for the change of acceleration caused by the metric.

\begin{observation}
\emph{
    Let $p(\bx)$ be a smooth density function. Let $(\bx_t, \bbv_t)$ be the geodesic flow with initial conditions $(\bx_0,\bbv_0)$  with respect to the Inverse Monge metric, such that $\bx_0$ is a local maximum.
    Then $\norm{\bbv_t}_{2}\geq \norm{\bbv_0}_{2}$ 
    for all $t\neq 0$.
    } \label{obs:invmonge}
\end{observation}


\paragraph{The Inverse Generative Metric} We use
%
\begin{align*}
\bG_{Ig}(\bx) &= 
    \left(\frac{p(\bx) + \lambda} {p_0 + \lambda}\right)^2 
    \boldsymbol{I}_D,
\end{align*}
%
and obtain the inverse $\bG_{Ig}^{-1}(\bx) = \bG_g(\bx)$ as the metric tensor of the standard Generative metric (Eq.~\eqref{eq:gen}) and the determinant as $1/\det(\bG_g)$. Again, the computational efficiency of the original Generative metric is retained. Figure~\ref{fig:geodesics_multimodal}  illustrates the main effect of the metric, that is to accelerate on low density regions (indicated by the light color); it also twists the trajectories slightly (best seen within the initial mode and beyond the second mode in the top right corner). 
Additionally Figure~\ref{fig:fig1} illustrates the behavior in a univariate distribution. 
The acceleration behavior is mathematically stated in Observation~\ref{obs:invgen}.


% Next, we explain how the Inverse Monge and Inverse Generative metrics achieve the desired mode-bridging ability. 

% Figure~\ref{fig:fig1} illustrates this effect for the Inverse Generative metric. The following observations collect the previous idea which we discuss further in Appendix~\ref{app:proof_props}:

\begin{observation}
\emph{
    Let $p(\bx)$ be a smooth density function. Let $(\bx_t, \bbv_t)$ be the geodesic flow with initial conditions $(\bx_0,\bbv_0)$ such that $p(x_0)>0$ with respect to the Inverse Generative metric. Then, for $t$ such that $p(\bx_t) \to 0$ we have $\norm{\bbv_t}_{2}>\norm{\bbv_t}_{0}$.
    } \label{obs:invgen}
\end{observation}

The mathematical details for Observations~\ref{obs:invmonge} and \ref{obs:invgen} are given in Appendix~\ref{app:proof_props}.

\begin{figure}[t]
    \centering
% \includegraphics[width=0.32\linewidth]{figs/rebutal/geodesic_euclidean.png}
\includegraphics[width=0.32\linewidth]{figs/rebutal/geodesic_inverse_monge.png}
\hspace{0.8cm}
\includegraphics[width=0.32\linewidth]{figs/rebutal/geodesic_inverse_generative.png}
\caption{
    Effect of the metric for multimodal targets, showing the geodesics (lines) and the relative compression of the distance (color; yellow means faster travel in that area, darker colors mean slower travel). %{\bf Left:} Euclidean metric. 
     {\bf Left:} Inverse Monge metric ($\alpha = 0.001$) helps more the geodesics to reach the other mode, and also slightly compresses the distances in the low-probability region.  
    {\bf Right:} Inverse Generative metric ($\lambda = 1$) compresses the distances in the low-probability region, but twists the paths only slightly.  
    }
    \label{fig:geodesics_multimodal}
\end{figure}

\section{Experiments}\label{sec:exp}

We evaluate \ourmethod\ for targets with sharp curvature (Section~\ref{sec:exp1}), multiple modes (Section~\ref{sec:exp2}), or both (Section~\ref{sec:exp3}), always considering different choices of the metric. We also empirically quantify the effect of the numerical integrator. A code reproducing the experiments is available at
%\footnote{ Code available at:
\href{https://github.com/williwilliams3/magss}{github.com/williwilliams3/magss}.  
%}.

\paragraph{Evaluation}
We use primarily targets with known reference samples, which allows measuring the accuracy using the 1-Wasserstein (earth mover's) distance with the samples provided by the algorithm \citep{Flamary2021}. 
Besides accuracy, we quantify the samplers with the probability of jumping between the different modes, as the raw ratio of consecutive samples that are within separate modes (defined manually for each problem).


\paragraph{Comparison methods}
To showcase the effect of the metric we will be running \ourmethod\ also with in Euclidean metric, with $\bG(\bx) = \boldsymbol{I}_D$, and we additionally compared against the No-U-Turn Sampler, parallel tempering and diffusive Gibbs sampling.

The No-U-Turn Sampler (NUTS) is an auxiliary-variable sampler that augments the position $\bx_t$ with a velocity $\bbv_t$ which jointly follow the Hamiltonian dynamics \citep{Neal2011}. It adaptively determines the integration time by stopping at the first U-turn, i.e., the first time $t>0$ such that $\langle \bx_t - \bx_0, \bbv_t \rangle < 0$ \citep{Hoffman2014}.

Parallel tempering (PT) runs many Markov Chains in parallel, each of which has $p(\bx)^{1/\tau_i}$ as targets for different temperatures $ \tau_i \geq 1$, with $\tau_1 = 1$ recovering the original target. As $\tau \to \infty$, the density flattens, facilitating transitions between regions of higher densities that are far apart from each other. The parallel chains jumps randomly between each other, thus visiting the modes more often according to a Metropolis-Hastings ratio \citep{Swendsen1986, Geyer1991}. 
Our implementation of PT follows \citet{Latuszynski2025}.

Diffusive Gibbs sampling (DiGS) by \citet{Chen2024} is a sampler designed for addressing multimodality.
It approaches the sampling task by using an auxiliary variable $\tilde{\bx}$ with a Gibbs scheme. It  uses the  variance preserving (VP) \citep{Song2021} noise scaling:  $p(\tilde{\bx}|\bx) = \cN(\tilde{\bx}|\alpha_t \bx, \sigma^2_t)$, where  $\sigma_{t}=\sqrt{1-\alpha_{t}^{2}}$, sampled directly and $p(\bx|\tilde{\bx})\propto p(\tilde{\bx}|\bx) p(\bx)$ sampled through a local MCMC sampler. It has an additional Metropolis within Gibbs proposal scheme $q(\bx|\tilde{\bx}) = \cN(\bx| \tilde{\bx}/\alpha_t, (\alpha_t/\sigma_t)^2)$.  VP has the property that at when $\alpha_t\to 0$ then $p(\tilde{\bx}|\bx)= \cN(\tilde{\bx}|0, \bI_D)$ and when $\alpha_t \to 1$ then, informally, $p(\tilde{\bx}|\bx)= \delta_{\bx}$.


\subsection{Complex unimodal targets}
\label{sec:exp1}

We evaluate the methods on three canonical benchmark targets (funnel, hybrid Rosenbrock and squiggle) which exhibit strong curvature. The densities are given in Appendix~\ref{app:toydist}. Since these targets are unimodal, we only consider the metrics presented in Section~\ref{sec:metrics_local} and exclude PT. 

Figure~\ref{fig:toydistdims} shows that \ourmethod\ with Fisher metric $\bG_F(\bx)$ is clearly superior, but runs out of the limited computational budget already at low dimensions, and the Monge metric $\bG_M(\bx)$ offers notable improvement for Rosenbrock and squiggle targets. DiGS remains on the level of the Euclidean \ourmethod\ and the Generative metric does not help either.


\begin{figure*}[t]
    \centering
    \includegraphics[width = 0.3\textwidth]{figs/rebutal/funnel_dims.png}    
    \includegraphics[width = 0.3\textwidth]{figs/rebutal/rosenbrock_dims.png}    
    \includegraphics[width = 0.3\textwidth]{figs/rebutal/squiggle_dims.png}    
    \caption{Univariate sampling accuracy in various metrics (Wasserstein distance, lower is better) for targets of varying dimensionality. The medians over 5 runs are connected with a line.
    Left: Funnel. Middle: Rosenbrock. Right: Squiggle.    
    }
    \label{fig:toydistdims}
\end{figure*}


\paragraph{Experiment specification:}
We obtain $10,000$ samples using $10$ chains and omit results for runs that did not complete in 12 hours.
We set $\alpha^2=1$ for the  Monge metric since this value has been shown to work \citep{Hartmann2022}.
We select $\lambda=1$, $p_0=1$  for the  Generative metric without further tuning.
We use Dopri5 integrator with adaptive step-size. We set $w=3$ and $m=8$.  DiGS and NUTS uses a single noise scale $\alpha = 1$ and step-size $0.1$ for MALA within the algorithm.


\subsection{Multimodal with simple modes}
\label{sec:exp2}

For studying mode exploration, we use a target of two
$D$-dimensional Gaussian distributions centered at $-\boldsymbol{1}_{D}$ and $\boldsymbol{1}_{D}$ with scale $\sigma=0.1$ and weights $\{0.2, 0.8\}$.
The distance between the modes ($\sqrt{D}2$) grows for increasing dimensions, making transition between the modes more difficult. Now we only consider the new metrics for boosting mixing between the modes (Section~\ref{sec:metrics_global}).

Figure~\ref{fig:twogauss} reports the corresponding accuracies and reports the percentage of jumps between the modes. While the comparison methods PT and DiGS explore the modes well in low dimensions, they get completely stuck in one mode for $D \geq 16$. \ourmethod\ and \metaourmethod\ with the Inverse Monge metric ($\alpha=0.1$) are able to jump between the modes even for higher dimensions and the \emph{meta-sampler} is overall the most accurate method. 


% {
% \renewcommand{\arraystretch}{0.9} % Reduce row height
% \setlength{\tabcolsep}{3pt} % Reduce column spacing
% \noindent
% \small
% \begin{table}[t]
%     \caption{Mixture of two Gaussians ($\bG_{IM}$, with $\alpha=0.1$).}
%     \centering
%     \begin{tabular}{@{}l *{7}{r} @{}}
%     \toprule
%     jump\% & \multicolumn{6}{c}{dimension} \\
%     \cmidrule(lr){2-7}
%           sampler  & 2  & 4  & 8  & 16  & 32  & 64 \\
%     \midrule
%     \ourmethod    & 8.96  & 5.07  & 2.28  & 0.8   & 0.2   & 0.03  \\
% \metaourmethod & 18.91  & 12.33  & 7.5  & 4.29  & 2.45  & 1.07  \\
%     DiGS    & 5.05  & 0.02  & 0.0   & 0.0   & 0.0   & 0.0   \\
%     PT      & 12.54 & 4.68  & 0.23  & 0.0   & 0.0   & 0.0   \\
%     \bottomrule
%     \end{tabular}
% \label{tab:twogauss}
% \end{table}
% }

\begin{figure}[t]
    \centering    
    \includegraphics[width=0.49\linewidth]{figs/rebutal/twogaussians_distances_dims.png}
    \includegraphics[width=0.49\linewidth]{figs/rebutal/twogaussians_distances_crossings_dims.png}
    \caption{Accuracy (lower is better) for mixture of Gaussians. Both \ourmethod\ variants use $\bG_{IM}$ with $\alpha=0.1$.}
    \label{fig:twogauss}
\end{figure}


\paragraph{Experiment specifications:}For MALA we find a stepsize of $60\%$ acceptance rate for each dimension, since it is close to the optimal for Gaussians \citep{Roberts1998}. For \ourmethod\ we
try the Euclidean metric and the grid: $\alpha^2\in \{10^{-3}, 10^{-2}, 10^{-1}, 1, 10\}$ and $\lambda \in \{10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}, 1, 10\}$. We find $\alpha^2=0.1$ is always optimal. For \metaourmethod\ we fix $\alpha^2=0.1$ based on what was observed for \ourmethod.
We use Dopri5 solver with adaptive step-size. DiGS uses $10$ MALA steps per iterations, $T=100$  equally spaced times between $\alpha_1=10^{-4}$, $\alpha_{1000} = 1-10^{-4}$. PT uses $N=100$ temperatures in the scale $\tau_i = b_{\min}^{-{i}/{N}}$ for $i=1,..,N$ where $b_{\mathrm{min}}=10^{-4}$.


\subsection{Multimodal with complex modes}
\label{sec:exp3}


To demonstrate that we can simultaneously handle multimodality and complex local geometry, we consider a (uniform) mixture of two narrow bivariate distributions, the Rosenbrock and Squiggle distributions (Figure~\ref{fig:narrow} left; the red line is purely for identifying jumps between the modes, Table~\ref{tab:narrow_jumps}).
We use use the Inverse Monge and Inverse Generative metrics.
However, Figure~\ref{fig:narrow} (right) indicates that PT is the least accurate method, requiring substantially more samples for matching the target well. All methods will reach approximately the same Wasserstein distance if ran long enough, but both of our variants achieve it in less samples, confirming more efficient mixing.



\begin{table}[t]
    \centering
    \caption{Mixture of narrow distributions.}
    \begin{tabular}{llll}
\toprule
sampler & metric & jump\% & t(s) \\
\midrule
PT & NA & $6.18$ & 2 \\
DiGS & NA & $0.27$ & 2 \\
\ourmethod & $\bG_{Ig}$, $\lambda=1.0$ & $0.81$ & 178 \\
\metaourmethod & $\bG_{IM}$, $\alpha^2=10^{-4}$ & $2.62$ & 1327 \\
\bottomrule
\end{tabular}
    
    \label{tab:narrow_jumps}
\end{table}

\paragraph{Experiment specifications:}
Obtain $10,000$ samples using $10$ chains.
We run DiGS with 5 noise steps between $0.1$ and $0.9$, and $10$ MALA iterations per sample. PT uses $\tau \in \{1.0, 5.62, 31.62, 177.83, 1000\}$ and thinning of $10$. \ourmethod\ and \metaourmethod\  are tuned using the same grid of values as Experiment~\ref{sec:exp2}, reporting the best based on distances. \metaourmethod\ uses $5$ sweeps and $10$ MALA iterations per sample. PT, DiGS and \metaourmethod\ rely on MALA with stepsize $0.001$ ($\approx60\%$ acceptance rate). We use $w = 3$ and $m=8$ and  the adaptive Dopri8 integrator. %with adaptive step-size. 


\begin{figure}[t]
    \centering    
    \begin{minipage}{0.5\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figs/narrow/generalmixture.png}
    \end{minipage}
    \begin{minipage}{0.49\linewidth}
        \centering
        \includegraphics[width=\linewidth]{figs/narrow/genmix_time.png}
    \end{minipage}
    \caption{Left: Mixture of narrow distributions, with samples using \metaourmethod.
    Right: Wasserstein distance as a function of iterations (samples).  For \ourmethod\ we use $\bG_{Ig}$ with $\lambda=1$  and  \metaourmethod\ $\bG_{IM}$ with $\alpha^2=10^{-4}$.}
    \label{fig:narrow}
\end{figure}

\subsection{Field System}

We include a highly multimodal target distribution of modes ($2^D$ for $D=16$) by replicating the Allen-Cahn Field System model experiment from \citet{Cabezas2024}. The distribution has two global maxima at $(1,..,1)$ and $(-1,..,-1)$, and several lower density modes at points of the form $x_i = \pm 1$ for all $i$. DiGS collapses to a single mode, PT explores only the two most dominant modes, while our \metaourmethod\ explores also the additional  modes (Fig~\ref{fig:fieldsystem}). 

% Since reference samples are not directly obtainable, following \citet{Cabezas2024} we report in Table~\ref{tab:phifour} the  Kernel Stein Discrepancy (KSD V-stat) \citep{Liu2016}. The target density is symmetric along each axis, we initialize the sampling at $(-1,..,-1)$. We choose the dimension $x_8$ as a representative of how well the samplers capture the symmetry. We also report the percentage of samples such that $x_8>0$ the true value is $50\%$. 

The target density is symmetric along each axis. The initial sampling position is $(-1,\ldots,-1)$ and we use the marginal distribution of $x_8$ to evaluate how well each sampler captures the symmetry. In particular, we report the percentage of samples with $x_8 > 0$, which should be $50\%$ under the true distribution. 
Since reference samples are not directly available, we follow \citet{Cabezas2024} and also report the Kernel Stein Discrepancy (KSD V-stat) \citep{Liu2016} in Table~\ref{tab:phifour}.

Our method explores more modes than the competing methods (PT and DiGS),
% This is likely because the sampler's ability to jump between modes results in some samples being located in regions with higher gradient norms, which increases the KSD, since it is depends on gradient of the log target density. It can penalize such transitions even if the overall the mode coverage is better.
although the KSD V-stat is worse. Note, however, that the KSD V-stat does not account for multimodality at all; DiGS has a better value despite covering only a single mode and failing completely in terms of the marginal distribution metric.
% We also note that the KSD V-stat does not account for multimodality: for example, DiGS achieves a low KSD value despite its samples covering only a single mode. 
In contrast, PT and \metaourmethod\ exhibit similar percentages of samples with $x_8 > 0$.
We provide the density of the model, an explanation of the multimodality of the model, and the computation of KSD V-stat in Appendix~\ref{app:toydist}.

\begin{table}[t]
    \centering
    \caption{Field System model}
\begin{tabular}{lllll}
\toprule
sampler &  KSD V-stat & $x_8>0$ & t(s) \\
\midrule
PT &   $0.13\pm0.04$ & $0.35$ & 6 \\
DiGS &   $0.13\pm0.05$ & $0.0$ & 157 \\
META-AGSS  & $2.98\pm0.7$ & $0.33$ & 32 \\
\bottomrule
\end{tabular}    
    \label{tab:phifour}
\end{table}


\paragraph{Experiment specifications:}
We obtain $10,000$ samples using $10$ chains initialized at $(-1,..,-1)$.
DiGS uses $T=1000$  equally spaced times between $\alpha_1=10^{-5}$, $\alpha_{1000} = 1-10^{-5}$. PT uses $N=200$ temperatures in the scale $\tau_i = b_{\min}^{-{i}/{N}}$ for $i=1,..,N$ where $b_{\mathrm{min}}=10^{-5}$. For \metaourmethod\ we try values of $\alpha$ and $\lambda$ in powers of ten, finding $\lambda=10^{-6}$ maximizes the number of jumps between modes and a single sweep. We use $w = 3$ and $m=8$ and  the Dopri5 integrator with adaptive step-size. For all methods  MALA uses $10$ iterations per sample and stepsize $0.005$ (roughly 60\% acceptance rate). 

\begin{figure}[t]
    \centering
    \includegraphics[width=0.32\linewidth]{figs/rebutal/field_digs.png}
    \includegraphics[width=0.32\linewidth]{figs/rebutal/field_pt.png}
    \includegraphics[width=0.32\linewidth]{figs/rebutal/field_meta.png}
    \caption{Samples from Allen-Cahn Field System model \citep{Cabezas2024} with $2^{16}$ modes that zig-zag between $-1$ and $1$ on the y-axis over the $D=16$ values at the x-axis. Euclidean DiGS (left) gets stuck in one mode, Parallel Tempering (middle) only explores two dominant modes with constant value over the x-axis, whereas Meta-MAGGS (right; $\boldsymbol{G}_{Ig}$ with $\lambda=10^{-6}$) explores also the modes that switch between the extremes.}
    \label{fig:fieldsystem}
\end{figure}


\subsection{Effect of numerical integrator} \label{sec:gmm}

We use numerical integrators for computing the geodesics in Eq.~\eqref{eq:geoeqs}. To explore the effect of the integrator, we present results for broad range of integrators for a multimodal benchmark task  considered previously by \citet{Chen2024}. The target is a 40-mode Gaussian mixture model with equal weights and each component of variance $\sigma=0.1$ where the means are distributed uniformly on the square $(-40, 40)^2$.

We use seven different integrators for $\bG_{IM}$ and $\bG_{Ig}$ metrics, including both adaptive and fixed step-sizes as implemented by \citet{Kidger2021}, and report the results in Figure~\ref{fig:gmm40}. The three main conclusion are: (a) For metrics that are further away from Euclidean (large $\alpha$ or small $\lambda$) the integration time for adaptive methods grows dramatically. This is a natural consequence of operating in a less flat geometry.
(b) For good accuracy we typically need to use such a geometry, which means there is inherent compromise between accuracy an computation.
(c) Simple fixed-step integrators, even the Euler method, are efficient when they work, but for robustness we recommended adaptive methods. We recognize dopri5 as a good practical recommendation, but Euler is worth trying for the Inverse Generative metric.

 
\begin{figure}[t]
    \centering
    \includegraphics[width=0.49\linewidth]{figs/gmm40/invGen_distances_mean_integrators.png}
    \includegraphics[width=0.49\linewidth]{figs/gmm40/invMon_distances_mean_integrators.png}
    \includegraphics[width=0.49\linewidth]{figs/gmm40/invGen_time_integrators.png}
    \includegraphics[width=0.49\linewidth]{figs/gmm40/invMon_time_integrators.png}
    \caption{Different numerical integrators for the 40 mode Gaussian mixture model. 
    The black dotted lines are PT and DiGS.
    Left: Inverse Generative metric with parameter $\lambda$. Right: Inverse Monge metric with parameter $\alpha^2$.     
    }
    \label{fig:gmm40}
\end{figure}

\paragraph{Experiment specifications:}
Obtain $10,000$ samples using $10$ chains. 
PT has $\tau \in \{1, 5.62, 31.62, 177.83, 1000\}$ and thinning of $200$. DiGS uses $\alpha=0.1$, thinning of $200$ and $5$ MALA steps per step. MALA has step size of $0.1$. 
 \ourmethod\ is run with $w=3$, $m=8$ for the $\bG_{IM}$ and $\bG_{Ig}$ and  metrics with the parameter grid of Experiment~\ref{sec:exp2}. We test seven different numerical integrators of Equation~\eqref{eq:geoeqs}. The fixed integration size is $0.01$. Details in Appendix~\ref{app:exp}.






\section{Related work}
\label{sec:related_work}

\citet{Lan2014} constructed the Wormhole Hamiltonian Monte Carlo where a specific geometry is built to (only) connect the modes of multimodal distributions. The modes are first identified along the Markov Chain evolution. After a new mode identification, it "stores" the mode's location for later use. A jump using the updated mode candidates guarantees correct detailed-balance equations. While this work served as an inspirational motivation for us, it requires notable additional components. However, \ourmethod \ does not require separate identification or storage of the modes, but instead shrinks the distances by naturally warping the space. 


\section{Conclusions}

Our aim was to show that local curvature and multimodality can be addressed by the same set of tools, namely Riemannian geometry. We provided a concrete Riemannian slice sampler, introduced two new metrics for improving mixing between modes, and showed that we can achieve accuracy and mixing comparable to recent samplers designed specifically for multimodal targets, by only using  Riemannian metrics for this task.

One obvious limitation is the computational cost, caused by numeric integration of the geodesics. Even when using metrics with fast inverses, the per-iteration cost of \ourmethod\ is larger than of competing methods. However, we note that we used maximally exact solvers rather than seeking for the highest computational efficiency. Now that the principle has been demonstrated, the use of more approximative numerical integrators for speeding up the overall computation could be studied in future work.
% 

% \begin{contributions} % will be removed in pdf for initial submission 
% 					  % (without ‘accepted’ option in \documentclass)
%                       % so you can already fill it to test with the
%                       % ‘accepted’ class option
%     Briefly list author contributions. 
%     This is a nice way of making clear who did what and to give proper credit.
%     This section is optional.

%     BW conceived the idea, developed the method, wrote the code and wrote the paper.
%     HY helped the discussion, wrote PT code and collaborated in the DiGS code.
%     MH helped with the writing, improves proposition 1.
%     HL helped with the theoretical understanding and writing. 
%     GA participated in the discussion. 
%     AK helped with the writing and discussions. 
% \end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
BW, HY, HPHL and AK were supported by the Research Council of Finland Flagship programme: Finnish Center for Artificial Intelligence (FCAI), and additionally by grants: 363317, 345811 and 348952. GA was supported by the DFF Sapere Aude Starting Grant ``GADL''. The authors wish to acknowledge CSC – IT Center for Science, Finland, for computational resources.
\end{acknowledgements}

% References
\bibliography{bibtex.bib}

\appendix

\include{appendix}


\end{document}
