Title: Mean-Field Langevin Dynamics for Signed Measures via a Bilevel Approach

Abstract: Mean-field Langevin dynamics (MLFD) is a class of interacting particle methods that tackle convex optimization over probability measures on a manifold, which are scalable, versatile, and enjoy computational guarantees. However, some important problems -such as risk minimization for infinite width two-layer neural networks, or sparse deconvolution -are originally defined over the set of signed, rather than probability, measures. In this paper, we investigate how to extend the MFLD framework to convex optimization problems over signed measures. Among two known reductions from signed to probability measures -the lifting and the bilevel approaches -we show that the bilevel reduction leads to stronger guarantees and faster rates (at the price of a higher per-iteration complexity). In particular, we investigate the convergence rate of MFLD applied to the bilevel reduction in the low-noise regime and obtain two results. First, this dynamics is amenable to an annealing schedule, adapted from [SWON23], that results in improved convergence rates to a fixed multiplicative accuracy. Second, we investigate the problem of learning a single neuron with the bilevel approach and obtain local exponential convergence rates that depend polynomially on the dimension and noise level (to compare with the exponential dependence that would result from prior analyses). * Equal contributions, authors ordered randomly. 1 The square exponent on ∥ • ∥T V might appear unusual, but it is convenient for our subsequent developments. We show in App. A that the regularization path is the same with or without the square.

Section: Introduction
Let M(W) be the set of finite signed measures on a compact Riemannian manifold without boundaries W and let G : M(W) → R be a convex function, assumed smooth in the sense of Assumption 1 below. In this paper, we investigate optimization methods to solve
min ν∈M(W) G λ (ν), G λ (ν) := G(ν) + λ 2 ∥ν∥ 2 T V ,(1.1)
where ∥ • ∥ T V is the total variation norm and λ > 0 the regularization level. 1 This covers for instance risk minimization for infinite-width 2-layer neural networks (2NN) [BRVDM05;Bac17] by taking W = S d the unit sphere in R d+1 or W = R d+1 and G(ν) = E (x,y)∼ρ ℓ(h(ν, x), y) where h(ν, x) = W φ(⟨x, w⟩)dν(w). (1.2)
Here φ : R → R is the activation function, h(ν, •) is the predictor parameterized by ν, G is the (population or empirical) risk under the data distribution ρ ∈ P(R d+1 ×R), and ℓ is smooth (uniformly in y) and convex in its first argument. These 2NNs will be our guiding examples throughout, but note that the class of problems covered by Eq. (1.1) is more general and includes for instance sparse deconvolution via the Beurling-LASSO estimator [DG12] or optimal design [MZ04].
To tackle such problems, interacting particle methods use the parameterization ν = m i=1 r i δ wi and apply gradient methods in a well-chosen geometry [Chi22c; YWR23; GCM23]. They have recently gained traction thanks to their scalability and flexibility, and in the context of 2NNs, the usual gradient descent algorithm is an instance of such a method. On the downside, global convergence guarantees remain difficult to obtain due to the nonconvex nature of the reparameterized problem and existing positive results require either very specific settings [LMZ20], or modifications of the dynamics which often limit their scalability2 .
In a related, but slightly different context, mean-field Langevin dynamics (MFLD) solve entropyregularized problems of the form min
µ∈P(W ′ ) F β (µ), F β (µ) := F (µ) + β -1 H(µ),(1.3)
where P(W ′ ) is the space of probability measures on a manifold W ′ (typically R d ), F : P(W ′ ) → R is a (sufficiently regular) convex functional, H(µ) = log(dµ/d vol)dµ is the negative differential entropy and β > 0. These dynamics are obtained as the mean-field limit of noisy interacting particles dynamics [MMN18;HRŠS21] and converge globally at an exponential rate [NWS22;Chi22b], under two key conditions on F : (i) a notion of regularity, which we refer to as displacement smoothness (see P1 below) and (ii) a uniform log-Sobolev inequality (LSI) condition (see P2 below). These mean-field, continuous-time guarantees have been further refined into computational guarantees for fully discrete algorithms [CRW22;SWN23]. The favorable properties of MFLD naturally lead to the following question:
Can we efficiently solve problems of the form Eq. (1.1) using MFLD?
At first, it is not obvious that MFLD can be applied at all since it is originally defined only for problems over probability measures. However, we can find in the literature two general recipes to reduce a problem over M(W) to a problem over P(W ′ ), thus amenable to MFLD. The first one is a lifting reduction, that takes W ′ = R × W where the extra dimension serves to encode the signed mass of particles [CB18, Section A.2] [Chi22c]. The second one, that takes W ′ = W, is a bilevel reduction [Bac21; TS24] that uses a variational representation of the regularizer ∥ • ∥ 2 T V , common in the multiple kernel learning literature [LCBGJ04]. A first task is thus to compare the behavior of MFLD on these two approaches. Furthermore, MFLD involves an entropic regularization which is absent from Eq. (1.1). A second task is thus to analyze the behavior of MFLD in the large β regime, when the regularization vanishes.
In this work, we tackle these two tasks and make the following contributions:
• In Sec. 3, we introduce the lifting and bilevel reductions and compare the "displacement smoothness" (P1) and "uniform LSI" (P2) properties of the resulting problems. These properties play a central role in the global convergence analysis of MFLD. Specifically, we consider a large class of lifting reductions and show that none satisfies simultaneously (P1) and (P2) unless λ is large. In contrast, the bilevel reduction satisfies both under mild assumptions. So in the sequel we focus on MFLD applied to the bilevel reduction. • In Sec. 4, we investigate what convergence rates can be obtained for the problem (1.1) by using MFLD on the bilevel formulation. While a classical simulated annealing technique yields convergence in O(log log t/ log t), we show that the structure of the bilevel objective is in fact amenable to a more efficient annealing schedule, adapted from [SWON23], that reaches a fixed multiplicative accuracy, say 1.01 inf G λ , in time e O(λ -1 log λ -1 ) instead of e O(λ -2 ) for the classical schedule.
• In Sec. 5, to obtain a more complete picture, we investigate the problem of learning a single neuron. Here, using a Lyapunov type argument, we show that the local convergence rate of MFLD applied to the bilevel formulation scales polynomially in β and d, at odds with all previous MFLD analyses which had exponential dependencies.
All proofs are deferred to the Appendix.

Section: Related work
Particle methods and mean-field limits. Interacting particle systems have been studied for decades in various fields, see e.g. [Szn91;CD13;Lac18]. Their more recent connection with the standard training of 2NNs [NS17; SS20; RV22; MMN18] has suggested new settings of analysis, where convexity of the functional plays a key role, and has led to many developments. In particular, the case of MFLD (under study here) quickly progressed from nonquantitative guarantees [MMN18;HRŠS21], to mean-field convergence rates [NWS22;Chi22b] and fully discrete computational guarantees [CRW22; SWN23; KZCE+24] in the span of a few years. Recent progress also address its accelerated (underdamped) version [CLRW24;FW23], which could also be of interest in our setting.
Multiple kernel learning and bilevel training of NNs. The lifting reductions we consider are inspired by the unbalanced optimal transport literature [LMS18], while the bilevel reduction comes from the Multiple Kernel Learning (MKL) literature [CVBM02; LCBGJ04; RBCG08] (see [Bac19] for an account). While the latter is usually studied with a discrete domain W (see also [PP21;PP23] for recent computational considerations), it was suggested for the training of large width 2NN in [Bac21] and used in conjonction with MFLD in [TS24] (more details below). Relatedly, a recent line of work studies the (noiseless) training of 2NN in a two-timescale regime, where the outer layer is trained at a much faster rate than the inner layer [BMZ23; MB23;BBP23]. This implicitly corresponds to optimizing the bilevel objective and leads to improved convergence guarantees.
The work that is closest to ours is [TS24], which considers the MFLD on a 2NN with weight decay where the outer layer is optimized at each step. They interpret the resulting dynamics as a kernel learning dynamics and study properties of the learnt kernel and its associated RKHS. While they do not formulate explicitly the problem Eq. (1.1), it can be shown that our approaches are equivalent when considering W = R d+1 in Eq. (1.2) (and adding an extra regularization). The details are given in Sec. A.2. Key advantages of our formulation with W = S d are that we cover the case of unbounded homogeneous activation functions (such as ReLU), and can obtain improved LSI.

Section: Background on guarantees for mean-field Langevin dynamics
The MFLD is defined as the Wasserstein gradient flow (µ t ) t∈R+ in P(Ω) of an objective of the form Eq. (1.3). It is characterized as the solution to the partial differential equation (PDE)
∂ t µ t = div(µ t ∇F ′ [µ t ]) + β -1 ∆µ t , µ 0 ∈ P(Ω).
(2.1)
where F ′ [µ] : Ω → R is the first variation of F at µ [San15, Sec. 7.2], defined by lim ϵ↓0 1 ϵ (F (µ + ϵ(µ ′µ)) -F (µ)) = F ′ [µ]d(µ ′µ) for any µ ′ ∈ P(Ω). This PDE corresponds to the mean-field limit (N → ∞) of the noisy particle gradient flow ω t ∈ Ω N : ∀i ≤ N, dω i t = -N ∇ ω i t F (N ) ω 1 t , ..., ω N t dt + 2β -1 dB i t , ω i 0 i.i.d.

Section: ∼ µ 0
where F (N ) ω 1 , ..., ω N = F 1 N N i=1 δ ω i and the B i t are N independent Brownian motions on Ω. The convergence guarantees for MFLD rely on three key properties: (P0) (Convexity) F is convex and is such that F β admits a minimizer µ * β . (P1) (Displacement smoothness) F is L-displacement smooth, in the sense that3 ∀µ ∈ P 2 (Ω), ∀ω ∈ Ω, max
s∈TωΩ ∥s∥ ω ≤1 ∇ 2 F ′ [µ](s, s) ≤ L, and ∀µ, µ ′ ∈ P 2 (Ω), ∀ω ∈ Ω, ∥∇F ′ [µ] -∇F ′ [µ ′ ]∥ ω ≤ L W 2 (µ, µ ′ ),
where ∇ 2 denotes the Riemannian Hessian.
(P2) (Uniform LSI) There exists α > 0 such that ∀t ≥ 0, F β satisfies local α-LSI at µ t , as in Def. 2.1.
Definition 2.1 (Local LSI). We say that a functional F β = F + β -1 H satisfies local α-LSI at µ ∈ P(Ω) if Z := Ω exp (-βF ′ [µ]) dω < ∞ and the proximal Gibbs measure μ := Z -1 exp(-βF ′ [µ]) ∈ P(Ω) satisfies α-LSI, that is
∀µ ′ ∈ P(Ω), H (µ ′ |μ) ≤ 1 2α I(µ ′ |μ),
where the relative entropy and relative Fisher Information are respectively defined as
H (µ ′ |μ) := Ω log dµ ′ dμ dµ ′ , I(µ ′ |μ) := Ω ∇ log dµ ′ dμ (ω) 2 ω dµ ′ (ω),
and ∥ • ∥ ω denotes the Riemannian metric.
We review some useful criteria for LSI in App. B. In particular, the uniform LSI property (P2) holds for example when training two-layer neural networks with a frozen second layer, under some technical assumptions such as bounded activation function. In fact in that case, the proximal Gibbs measures μ even satisfy LSI uniformly for all µ ∈ P(Ω) [Chi22b;NWS22].
Note that the Riemannian gradient ∇ and the Laplace-Beltrami operator ∆ appearing in (2.1), as well as the definition of Brownian motion, depend on the Riemannian metric of Ω. This dependency is reflected in (P1) and (P2).
The global convergence of MFLD is guaranteed by the following theorem, with a rate. Theorem 2.1 ([Chi22b, Thm. 3.2][NWS22, Thm. 1]). Consider F : P(Ω) → R and (µ t ) as in (2.1). If (P0), (P1) and (P2) are satisfied then for t ≥ 0 it holds
β -1 H(µ t |µ * β ) ≤ F β (µ t ) -F β (µ * β ) ≤ exp(-2β -1 α t) F β (µ 0 ) -F β (µ * β ) .
Note that although the L-smoothness constant does not appear in Thm. 2.1, it does appear in the discrete-time guarantees of [SWN23], and is thus an important quantity in practice. In this paper, we limit our analysis to the mean-field dynamics (2.1) because its time-discretization has not yet been studied on Riemannian manifolds. In continuous time, the proof of Thm. 2.1 translates directly to Riemannian manifolds thanks to our definition of (P1), see App. B.

Section: Reductions from signed measures to probability measures
In order to apply the MFLD framework to solve our initial problem over signed measures (1.1), we must first recast it as an optimization problem over probability measures. In this section we build two such reductions, and discuss the properties (P0, P1 and P2) of the resulting problems.

Section: Reduction by lifting
Reductions by lifting consist in representing signed measures as projections of probability measures in the higher dimensional space Ω = R × W. This construction involves the 1-homogeneous projection operator where f (w) = ∥ν∥ T V dν d|ν| (w) (and only for this µ when b > 1). In particular, if G λ admits a minimizer then F λ,b does too, and it holds
min µ∈P b (Ω) F λ,b (µ) = min ν∈M(W) G λ (ν).
It is not difficult to see that F λ,b satisfies (P0) as long as G λ admits a minimizer. In order to study (P1) and (P2), we need to define a Riemannian metric on Ω. Following [Chi22c], we consider a general class of Riemannian metrics on Ω * := R * × W, parameterized by q r , q w ∈ R and Γ > 0, defined by
δr 1 δw 1 , δr 2 δw 2 (r,w) = Γ -1 |r| qr δr 1 δr 2 r 2 + |r| qw ⟨δw 1 , δw 2 ⟩ w . (3.2)
This indeed defines an inner product on T (r,w) Ω * := R×T w W that varies smoothly, and so equips Ω * with a (disconnected) Riemannian manifold structure [Lee18]. Intuitively, the parameter Γ will govern the relative speed of the weight or position variables along gradient flows; larger Γ means faster weight updates.
Two particular cases of this construction appear (sometimes implicitly) in the literature on 2NN:
(i) when q r = 2 and q w = 0, the metric (3.2) extends to the product metric on Ω = R × W.
With W = R d+1 , this corresponds to the usual parameterization of 2NNs and is the setting of most previous works applying MFLD to 2NN (with a weight decay regularization on the second layer for b = 2 and λ > 0).
(ii) when q r = q w = 1, Ω * is isometric to the union of two copies of the (tipless) metric cone over W [BBI01] (via the mapping (r, ω) → (sign(r), |r|, ω)). This is the natural setting for optimization over signed measures; and with W = S d , is equivalent to the parameterization of 2NNs with ReLU activation and balanced initialization [CB20, App. H].
Issues caused by the disconnectedness of Ω * . On the level of the equivalence of variational problems, one can check that the statement of Prop. 3.1 also holds if Ω = R × W is replaced by Ω * = R * × W. However, when the manifold Ω * is truly disconnected,5 then P(Ω) is not connected in the sense of absolutely continuous curves in Wasserstein space. More precisely, Ω * is the disjoint union of Ω * + = R * + × W and Ω * -= R * -× W, and one can show that (for certain choices of q r , q w ), if (µ t ) t is a Wasserstein gradient flow (or any other absolutely continuous curve), then µ t (Ω * + ) = µ 0 (Ω * + ) for all t. Moreover, supposing for simplicity that G λ has a unique minimizer ν and that b > 1, then F λ,b has a unique minimizer µ * , and µ * (Ω * + ) = ν + (W)/ ∥ν∥ T V where ν = ν +ν -is the Jordan decomposition of ν. Therefore, Wasserstein gradient flow for F λ,b can only converge to µ * if it was initialized such that µ 0 (Ω * + ) = µ * (Ω * + ). In terms of particle methods, this means that the fraction of the particles (r i , w i ) initialized with r i > 0 must be precisely µ * (Ω * + ). A similar problem arises if we apply MFLD to F λ,b , since it is nothing else than Wasserstein gradient flow for F λ,b + β -1 H; but it is more tedious to discuss formally, as F λ,b + β -1 H does not have a minimizer in general.
In order to bypass this limitation, one may focus on settings where the ratio ν + (W)/ ∥ν∥ T V for the optimal ν is known in advance, e.g., the problem (1.1) constrained to non-negative measures, or on choices of q r , q w for which Ω * can be extended into a connected manifold, such as the product metric q r = 2, q w = 0. However, even in those cases, MFLD on F λ,b presents other limitations.
Incompatibility with MFLD. We now show that, in spite of the degrees of freedom given by the parameters q r , q w and b, satisfying both (P1) and (P2) requires restrictive assumptions. This suggests that the lifting approach is fundamentally incompatible with MFLD. Proposition 3.2. Consider F λ,b from Eq. (3.1) and Ω * equipped with the metric (3.2). Suppose G ′ [ν] is continuous for all ν and that there exists ν such that ∇ 2 G ′ [ν] is not constant equal to 0. Then • If q r ̸ = 1 or q w ̸ = 1 or b ̸ = 1, then (P1) does not hold.
• If q r = q w = b = 1, then for any µ ∈ P 1 (Ω), there exists λ 0 > 0 such that F λ,b + β -1 H does not satisfy local LSI at µ for any λ < λ 0 (in particular (P2) does not hold unless λ is large enough).
When q r = q w = b = 1 and λ is large enough, then it can indeed be shown that Thm. 2.1 applies under natural conditions, see for instance 2), need to be adapted). Note that the problem considered in [Chi22c] is of the form G(ν) + λ ∥ν∥ T V , and they analyzed Wasserstein gradient flow on F λ,1,1 with q r = q w = 1 (in particular the issues caused by the disconnectedness of Ω * are bypassed thanks to the choice b = 1). The above discussion shows that applying MFLD to that problem would only yield convergence guarantees for λ large enough.

Section: Reduction by bilevel optimization
We define the bilevel objective functional J λ for η ∈ P(W) as 6 = inf η∈P(W) J λ (η). Moreover, the objective minimized in (3.3) is jointly convex in (η, ν) and partial minimization preserves convexity, so J λ is convex. Let us gather these crucial remarks in a formal statement.
J λ (η) := inf ν∈M(W) G(ν) + λ 2 W |ν| 2 η . (3
Proposition 3.3. The bilevel objective J λ is convex and inf
P(W) J λ = inf M(W) G λ . Moreover, if G λ admits a minimizer ν ∈ M(W), then arg min J λ = |ν| ∥ν∥ T V , ν ∈ arg min G λ .
Link between the lifted and bilevel reductions. The equality case in the statement of Prop. 3.1 shows that we can restrict the lifted reduction to measures µ ∈ P b (Ω) of the form µ(dr, dw) = δ f (w) (dr)η(dw) for some f : W → R and η ∈ P(W). Since they satisfy hµ(dw) = f (w)η(dw), the lifted reduction with b = 2 thus rewrites
min η∈P(W) min f ∈L 2 (η) G(f η) + λ 2 W f (w) 2 dη(w).
After the change of variable (ν, η) = (f η, η), the outer objective is precisely J λ (η). Thus, Wasserstein gradient flow on J λ can be seen as a two-timescale optimization dynamics: it is the Wasserstein gradient flow on F λ,2 in the limit where Γ → ∞. In the context of 2NN training with the parametrization (i), this amounts to training the output layer infinitely faster than the input layer, as done in [BMZ23; MB23; BBP23; TS24]. This remark allows to implement the bilevel MFLD numerically by discretizing in time the system of SDEs, for fixed large N and Γ, ∀i ≤ N, where
dr i t = -Γ ∇ r i F ′ λ,2 [µ t ](r i t , w i t )dt = -Γ G ′ [ν t ](w i t ) + λr i t dt (3.4) dw i t = -∇ w i F ′ λ,2 [µ t ](r i t , w i t )dt + 2β -1 dB i t = -r i t ∇G ′ [ν t ](w i t )dt + 2β
µ t = 1 N N i=1 δ (r i t ,w i t ) and ν t = 1 N N i=1 r i t δ w i t , and taking η t = 1 N n i=1 δ w i t .
Notice the absence of noise term on the weight variables r; it reflects the fact that MFLD for the bilevel objective is not a limit case of MFLD for the lifted objective, as the noise would prevent to reach optimality in the inner problem.
Compability with MFLD. We now show that, in contrast to the lifting reduction, the bilevel reduction is amenable to MFLD. The main assumption on (1.1) is as follows. Assumption 1. G : M(W) → R is non-negative and admits second variations, and for each i ∈ {0, 1, 2}, there exist
L i , B i < ∞ such that ∇ i G ′′ [ν](w, w ′ ) w ≤ L i and ∇ i G ′ [ν] w ≤ L i ∥ν∥ T V + B i for all ν ∈ M(W) and w, w ′ ∈ W. Moreover there exists L 2 < ∞ such that ∥∇ w ∇ w ′ G ′′ [ν](w, w ′ )∥ ≤ L 2
for all ν, w, w ′ . Furthermore, W is compact and the uniform probability measure τ on W satisfies LSI with constant α τ .
Concrete settings that satisfy Assumption 1 are discussed in Sec. 5. The following proposition confirms the compatibility with MFLD and gives quantitative bounds on the LSI constant. Proposition 3.4. Under Assumption 1, J λ satisfies (P0), (P1) and (P2). More precisely, for any
η ∈ P(W), J λ + β -1 H satisfies local LSI at η with the constant α η = α τ exp -1 λ L 0 βJ λ (η) . Further, J λ + β -1 H satisfies α-LSI uniformly along the MFLD trajectory (η t ) t with the constant α = α τ exp -1 λ L 0 β min G(0), J λ (η 0 ) + β -1 H (η 0 |τ ) .
In view of the negative result of Prop. 3.2 for the lifting reduction, and the positive result of Prop. 3.4 for the bilevel reduction, in the sequel we focus on MFLD applied on J λ , which we will refer to as MFLD-Bilevel.

Section: Global convergence and annealing for MFLD-Bilevel
While the bounds from Prop. 3.4 along with Thm. 2.1 allow to establish global convergence to minimizers of J λ + β -1 H, our aim is to minimize the unregularized bilevel objective J λ . This can be achieved by annealing the temperature parameter β -1 along the dynamics. Namely, Theorem 4.1 of [Chi22b] guarantees that by choosing β t = c log(t) for an appropriate constant c, the annealed MFLD trajectory
∂ t η t = div(η t ∇J ′ λ [η t ]) + β -1 t ∆η t satisfies J λ (η t ) -inf J λ = O log log t log t
. This is a very slow rate however.
In this section, we show that the structure of J λ originating from the bilevel reduction can be exploited to go beyond the generic guarantees from [Chi22b,Thm. 4.1]. Namely, we study in detail an alternative temperature annealing strategy, and we show that it improves upon the classical one β t ∼ log(t) in terms of convergence to a fixed multiplicative accuracy.

Section: Faster convergence to a fixed multiplicative accuracy
Definition 4.1. Suppose 0 ̸ ∈ arg min G, so that J * λ := inf J λ > 0. We will say that MFLD-Bilevel with a given temperature annealing schedule (β t ) ≥0 converges to (1 + ∆)-multiplicative accuracy in time-complexity T ∆ , for a fixed positive constant ∆ (say
∆ = 0.01), if J λ (η T∆ ) ≤ (1 + ∆)J * λ .
Note that in machine learning settings where the problem (1.1) corresponds to learning with overparameterized models, it is realistic to assume J * λ to be small (as long as the regularization λ is small), and T ∆ is the time it takes for the annealed MFLD to achieve a suboptimality of at most ∆J * λ . For ease of comparison, let us report the time-complexity T ∆ that can be achieved by simply running MFLD-Bilevel with a constant but well-chosen β, based on the bounds from Prop. 3.4 and Thm. 2.1. Proposition 4.1 (Baseline "annealing" schedule: constant β t ). Under Assumption 1, let ∆ > 0 and assume that ∆ ≤ L0L1G(0) λ 2 J * λ . Then, MFLD-Bilevel with the temperature schedule ∀t,
β t = 4d ∆J * λ log CB ∆J * λ converges to (1 + ∆)-multiplicative accuracy in time T ∆ ≤ C ′ ∆J * λ log CB ∆J * λ • exp C ′ L 0 G(0) λ ∆J * λ log CB ∆J * λ • log 2G(0) ∆J * λ + C ′ H (η 0 |τ )
where B = poly(L 0 , L 1 , B 1 , G(0), λ -1 ) and C, C ′ are constants dependent on W (and d and α τ ).
For the annealing schedule β t ∼ log(t), the time-complexity T ∆ that can be guaranteed from inspecting the proof of [Chi22b,Thm. 4.1] has the same dependency on d, λ and J * λ as for the baseline β t = cst. Improved annealing schedule. Recall the result of Prop. 3.4: for any β > 0, J λ + β -1 H satisfies local α η -LSI at η with α η = α τ exp(-L0 λ βJ λ (η)). Informally, if we manage to control J λ (η t ) along the annealed MFLD trajectory and show that it decreases, then we can increase β t at the same rate, while retaining the same local LSI constant. This observation and the resulting annealing procedure were introduced in [SWON23], in a 2NN classification setting with the logistic loss. There the optimal value of the loss functional, corresponding to our J * λ , is 0, and the annealing procedure yields favorable rates for global convergence. Here we show that this procedure is also applicable for MFLD-Bilevel, as soon as G satisfies the mild Assumption 1, yielding favorable rates for convergence to a fixed multiplicative accuracy. 7Theorem 4.2. Under Assumption 1, there exist constants B = poly(L i , B i , G(0), λ -1 ) and C i dependent only on G(0), H(η 0 ), W (and d and α τ ) such that the following holds. For any ∆ ≤ B J * λ , MFLD-Bilevel with the temperature schedule (β t ) t≥0 defined by ∀k ≤ K, ∀t ∈ [t k , t k+1 ], β t = 2 k d where t 0 = 0 and K = ⌈2 log 2 (B/(∆J * λ ))⌉ and
t k+1 -t k = C 1 2 k k • exp L 0 d λ C 3 ∆ log B ∆J * λ + C 2 ,
achieves (1 + ∆)-multiplicative accuracy, with time-complexity
T ∆ ≤ t K+1 ≤ C 4 ∆J * λ log B ∆J * λ 2 • exp L 0 d λ C 3 ∆ log B ∆J * λ + C 2 .
Note that assuming that G admits a minimizer ν 0 and that min G = 0, as is typically the case in overparametrized machine learning settings, then by the envelope theorem
J * λ = inf G + λ 2 ∥•∥ 2 T V = ∥ν0∥ 2 T V
2 λ+o(λ). So in the regime of small λ, ignoring the subexponential factors, the time complexity bound achieved by the annealing schedule of Thm. 4.2 scales as exp cλ -1 log λ -1 for a constant c. This improves upon the time complexity bound of the classical annealing procedure β t ∼ log(t) (the same as in Prop. 4.1), which scales as exp(c ′ λ -2 ).

Section: Local LSI constant at optimality for learning a single neuron
Devising temperature annealing schemes for global convergence, as illustrated in the previous section, relies on bounds on the local LSI constant at every iterate η t of the (annealed) MFLD. Such bounds are readily provided by the widely applicable Holley-Stroock perturbation argument, on which for example our Prop. 3.4 is based, but may be overly pessimistic. Indeed in this section, we demonstrate that for MFLD-Bilevel, the LSI constant at convergence can be independent of β, λ and d, instead of exponential in β as a global analysis would suggest.
More precisely, we are interested in α * , the best local LSI constant of J λ,β := J λ + β -1 H (•|τ ), at η λ,β := arg min J λ,β . In fact the proximal Gibbs measure of the optimum is the optimum itself: η λ,β = η λ,β , so α * is precisely the LSI constant of η λ,β . A bound on α * is of interest, especially in the regime of large β (low entropic regularization), for two reasons. Firstly, it directly implies a local convergence bound on MFLD-Bilevel, as shown in the proposition below. Secondly, characterizing the dependency of α * on β may open the way to more efficient temperature annealing strategies; but this is out of the scope of this paper. Proposition 5.1. Under Assumption 1, suppose η λ,β satisfies LSI with some constant α * β . For any ε > 0, there exists a sublevel set of J λ,β such that, for any initialization η 0 in this sublevel set,   , "Conic" refers to using the metric (3.2) with q r = 1, q w = 1, while "Canonical" refers to the choice of q r = 2, q w = 0.
J λ,β (η t ) -inf J λ,β ≤ (J λ,β (η 0 ) -inf J λ,β ) e -(α * β β -1 -ε)t .
For the local LSI analysis, we focus on a specific setting of (1.1), namely, least-squares regression using a 2NN with a normalization constraint on the first-layer weights, and a single-neuron teacher network. See Fig. 1 for an illustrative numerical experiment. Note that Assumption 2, with additional bounded-moment assumptions on φ and ρ, is a special case of Assumption 1, as shown in Prop. F.4. Assumption 2. W = S d is the Euclidean sphere in R d+1 and there exist ρ a covariate distribution over R d+1 , y ∈ L 2 ρ (R d+1 ) a fixed target function, and φ : R → R a C 2 activation function such that
G(ν) = 1 2 E x∼ρ |ŷ ν (x) -y(x)| 2 where ŷν (x) = W φ(⟨w, x⟩)dν(w).
Under the above assumption, we show in Prop. F.1 a simplified expression for the bilevel objective and its first variation,
J λ (η) = λ 2 ⟨y, (K η + λ id) -1 y⟩ L 2 ρ , J ′ λ [η](w) = - λ 2 ⟨φ(⟨w, •⟩), (K η + λ id) -1 y⟩ 2 L 2
ρ , where K η is the integral operator in L 2 ρ of the kernel k η (x, x ′ ) = φ(⟨w, x⟩)φ(⟨w, x ′ ⟩)dη(w) and id is the identity operator on L 2 ρ . Additionally, we make the following assumption on the data distribution ρ and on the response y. Assumption 3. ρ is rotationally invariant and the labels come from a single-index model: y = φ(⟨v, x⟩) for some fixed v ∈ W.
With the above assumptions, we can state the main theorem of this section. Theorem 5.2. Under Assumptions 2 and 3, there exists a function g :
[-1, +1] → R + such that J ′ λ [δ v ](w) = -λg(⟨w, v⟩)
for any w ∈ S d . Suppose that λ ≤ 1 and that there exist constants c i , C i > 0 such that for all r ∈ [-1, +1],
c 1 ≤ g ′ (r) ≤ C 1 , g ′′ (r) ≥ -C 2 , g ′′ (r)(1 -r 2 ) 1/2 ≤ C 3 , g ′′′ (r)(1 -r 2 ) 3/2 ≤ C 4 .
Then there exist constants α v , D 0 (dependent only on the c i , C i ) such that for any
β ≥ D 0 dλ -1 , δ v ∝ e -βJ ′ λ [δv] τ satisfies α v -LSI. Furthermore, if additionally 1 d 2 E x∼ρ ∥x∥ 4 , φ (i) L 4 (ρ) < ∞ for i ∈ {0, 1, 2} where ∥φ∥ p L p (ρ) := |φ(⟨w, x⟩)| p dρ(x) (independent
of w as ρ is rotationally invariant), then there exists a constant α * dependent only on those constants and on the c i , C i such that, provided that β ≥ poly(d, λ -1 ), η λ,β satisfies α * -LSI.
The proof is based on the observation that η λ,β ≈ arg min J λ = δ v the Dirac measure at v, for certain regimes of β and λ, in the Wasserstein metric. Thus we show that J ′ λ [δ v ] is amenable to a Lyapunov type argument inspired from [MS14; LE23], and then transfer its properties to
J ′ λ [η λ,β ].
We now verify the assumptions of Thm. 5.2 for a class of smooth, non-negative, and monotone activations which includes some popular practical choices such as the Softplus φ(z) = ln(1 + e z ) and sigmoid φ(z) = 1/(1 + e -z ). While we only consider smooth activations here for simplicity, certain non-smooth activations such as a leaky version of ReLU can also satisfy the conditions of Thm. 5.2.
Proposition 5.3. Suppose Assumptions 2 and 3 hold, and b
1 (d + 1) ≤ E[∥x∥ 2 ] ≤ E[∥x∥ 12 ] 1/6 ≤ b 2 (d + 1) for constants b 1 , b 2 > 0. Let m := 2b 3/2
2 /b 1 . Suppose φ and φ ′ are non-negative, inf |z|≤m φ(z) ∧ φ ′ (z) > 0 and φ (i)  L 4 (ρ) < ∞ for i ≤ 3. Then, φ satisfies the assumptions of Thm. 5.2 with constants that only depend on b 1 , b 2 , and φ.

Section: Conclusion
In this paper, we investigated how mean-field Langevin dynamics (MFLD), an optimization dynamics over probability measures with global convergence guarantees, can be leveraged to solve convex optimization problems over signed measures of the form (1.1). For a large class of objectives G, we highlighted that MFLD with a lifting approach necessarily runs into some issues, whereas the bilevel approach always inherits the guarantees of MFLD, leading to convergence guarantees for G λ via annealing. Finally, turning to a 2-layer NN learning task which can be stated as an instance of (1.1), we showed that the local LSI constant of MFLD-Bilevel can scale much more favorably with d and β than a generic analysis would suggest.
Another approach to tackle (1.1) could be to build noisy particle dynamics directly in the space of signed measures, complementing the MFLD updates with, for instance, a birth-death process. A challenge then is to build such dynamics that can be efficiently discretized. It is also an interesting question for future works to find other settings to which MFLD can be extended, beyond signed measures.
A Details for Sec. 1 (introduction)
A.1 Using ∥•∥ 2 T V vs. ∥•∥ T V

Section: as the regularization term
The optimization problems we consider in this paper are of the form (1.1), that is, for ease of reference,
min ν∈M(W) G λ (ν), G λ (ν) := G(ν) + λ 2 ∥ν∥ 2 T V .
Note the regularization term λ 2 ∥ν∥ 2 T V . This is to be contrasted with the more usual form of optimization problems
min ν∈M(W) G λ(ν), G λ(ν) := G(ν) + λ ∥ν∥ T V ,
which uses ∥ν∥ T V as the regularization.
On the level of variational problems, these two classes of problems are equivalent, in the sense that
{0} ∪ λ≥0 arg min G λ = {0} ∪ λ≥0 arg min G λ
where "0" refers to the zero measure on W. Indeed, note that by convexity, the argmins are determined by the respective first-order optimality conditions, so that
λ≥0 arg min G λ = ν ∈ M(W); ∀w, G ′ [ν](w) + λ ∥ν∥ T V ν(dw) |ν(dw)| = 0, λ ∈ R + λ≥0 arg min G λ = ν ∈ M(W); ∀w, G ′ [ν](w) + λ ν(dw) |ν(dw)| = 0, λ ∈ R + .
To see that the set on the first line is contained in the second, let ν ∈ arg min G λ , then ν satisfies the first-order optimality condition for G λ with λ = λ ∥ν∥ T V . Conversely, if ν ∈ arg min G λ then either ν = 0 or ν ∈ arg min G λ with λ = λ ∥ν∥ T V .
In terms of optimization convergence guarantees, when using the reduction by lifting, the problems with ∥•∥ T V vs. with ∥•∥ 2 T V regularization give rise to similar analyses, as discussed in Rem. 3.1. However when using the reduction by bilevel optimization, it seems that only the problem with ∥•∥ 2 T V regularization is amenable to a precise analysis. This is perhaps most apparent in our derivation of the simplified expression for the bilevel objective, Prop. D.2.

Section: A.2 Detailed comparison with Takakura and Suzuki [TS24]
In this subsection, we show that the learning dynamics considered by [TS24, Sec. 2, 3] is an instance of a variant of MFLD applied to the bilevel reduction of (1.1). We do this by recalling their setting (in the case of single-task learning for simplicity) in notations that are compatible with ours.
• For a set of first-layer weights w i ∈ W := R d and second-layer weights a i ∈ R (for 1 ≤ i ≤ N ), and an activation function φ : R → R, the associated 2NN is defined as
x → 1 N N i=1 a i φ(w ⊤ i x).
• For µ ∈ P(R × W), the associated infinite-width 2NN is x → R×W aφ(w ⊤ x)dµ(a, w). Note that in our notation of Sec. 3.1, this also writes x → W φ(w ⊤ x)d[hµ](w).
• Consider a data distribution ρ(dx, dy) ∈ P(R d
x × R y ). We may define the Hilbert space of predictors H = L 2 ρ x (R d x ), and the "single first-layer neuron predictor" mapping ϕ : W → H by ϕ(w)(x) = φ(w ⊤ x). The predictor associated to an infinite-width 2NN parametrized by µ is then R×W aϕ(w)dµ(a, w).
• Consider a loss function ℓ(ŷ, y) : R y × R y → R, inducing a risk functional over predictors given by R(h) = E (x,y)∼ρ [ℓ(h(x), y)]. We may define the unregularized risk functional over (infinite-width) 2NN weights by
L(µ) = R R×W aϕ(w)dµ(a, w) = R W ϕ(w)d[hµ](w) .
Accordingly, let the operator Φ : M(W) → H such that Φν = W ϕ(w)dν(w), and
G(ν) = R(Φν) = R W ϕ(w)dν(w) .
Then the unregularized risk is L(µ) = G(hµ). • The regularized risk functional considered in [TS24, Sec. 2.1] is
F(µ) = R R×W aϕ(w)dµ(a, w) + λ 2 R×W a 2 dµ(a, w) + 1 2σ 2 R×W ∥w∥ 2 dµ(a, w) (A.1) = G(hµ) + λ 2 R×W a 2 dµ(a, w) + 1 2σ 2 R×W ∥w∥ 2 dµ(a, w).
(More precisely, "F (f, η)" in their notation corresponds to our F δ f (w) (da)η(dw) , their "λ a " corresponds to our λ , and their "λ w " corresponds to our 1/σ 2 .) Note that, in our notation of Sec. 3.1,
F(µ) = F λ,2 (µ) + 1 2σ 2 R×W ∥w∥ 2 dµ(a, w).
• The bilevel limiting functional, which is the main object of study of [TS24, Sec. 2.1], is then defined as the mapping G :
P(W) → R such that G(η) = inf f :W→R F δ f (w) (dr)η(dw) , corresponding precisely to G(η) = J λ (η) + 1 2σ 2 W ∥w∥ 2 dη(w)
in our notation of Sec. 3.2 (see the paragraph "Link between the lifted and bilevel reductions"). Interestingly, the convexity of G is almost immediate with our presentation, as it is expressed as a partial minimization of a convex function, whereas the proof of the convexity of G in [TS24] is quite involved. They also introduce a functional "U " which corresponds precisely to our J λ (η), and which is an important auxiliary object in their analysis. • The learning dynamics studied from Section 2.3 onwards in [TS24] (except for the label noise procedure in Section 5), is precisely MFLD for G(η):
∂ t η t = β -1 ∆η t + div (η t ∇G ′ [η t ]) = β -1 ∆η t + div η t ∇J ′ λ [η t ] + 1 σ 2 w (A.2)
(and their constant "λ" corresponds to our β -1 ).
"MFL + confining" dynamics. The PDE (A.2) can be interpreted as a variant of MFLD for J λ in two ultimately equivalent ways: one is as the MFLD PDE (2.1) with an added "confining" term -1 σ 2 w, which intuitively encourages the noisy particles to remain close to the origin. Another is as Wasserstein gradient flow for the regularized functional
J λ,β,σ = J λ + β -1 H + 1 2σ 2 W ∥w∥ 2 dη(w) = J λ + β -1 H • β -1/2 σγ where β -1/2 σγ := N (0, β -1 σ 2 I d ),
whereas MFLD for J λ is the Wasserstein gradient flow for the functional regularized by entropy only,
J λ,β = J λ +β -1 H (•|τ ) = J λ +β -1 H +cst.
Unsurprisingly in view of this second interpretation, the distribution β -1/2 σγ plays a similar role in the analysis of convergence of (A.2) [TS24, Lemma 3.5], as played by the uniform measure τ in our paper: the local LSI property of J λ,β,σ (resp. J λ,β ) is obtained by applying the Holley-Stroock argument using β -1/2 σγ (resp. τ ) as a reference measure.
Note that the additional confining term -1 σ 2 w in (A.2) cannot be captured straightforwardly by any additional penalty term on the objective G from (1.1). Indeed, informally, the three terms in (A.1) each have a different homogeneity in the variable a. Rather, the confining term in σ should be viewed as corresponding to another regularization term added to (1.3), besides the entropy one in β -1 .
In short, while our work considers MFLD i.e. Wasserstein gradient flow for F + β -1 H as the main "algorithmic primitive", the work of [TS24] considers a MFL+confining dynamics, i.e. Wasserstein gradient flow for
F + β -1 H • βσ 2 γ .
Summary of differences. On a technical level, the learning dynamics considered by [TS24] corresponds to a special case of a variant of the MFLD-bilevel we consider from Sec. 3.2 onwards. Namely, they focus on instances of the problem (1.1) where G has a particular form, corresponding to learning with 2NN; and they consider W = R d and use an additional confining term -1 σ 2 w in the MFLD dynamics, while we consider settings where W is a compact Riemannian manifold, and no additional confining term is needed.
We also emphasize that, while our work and that of [TS24] cover some similar settings, our focus is quite different. In that work, the key object of interest is the kernel that is learned by MFLD in a 2NN setting ((x, x ′ ) → φ(x ⊤ w)φ(x ⊤ w ′ )dη(w) in the notation of our second bullet point above). By contrast, our main motivation is a general optimization question: how to use MFLD as an algorithmic primitive for problems of the form (1.1). In particular we do not assume a particular form for G except in Sec. 5, and we pay special attention to the bounds on the local LSI constants of J λ along the MFLD trajectory, instead of using the global uniform LSI bound (compare Prop. 3.4 and [TS24, Lemma 3.5]).

Section: B Details for Sec. 2 (background about MFLD) B.1 The displacement smoothness property
For MFLD (Eq. (2.1)) to be well-posed, we require that F is L-smooth along Wasserstein geodesics for some L < +∞. More precisely, for any constant-speed Wasserstein geodesic (µ t ) t∈[0,1] ⊂ P 2 (Ω) with W 2 (µ 0 , µ 1 ) = 1, t → F (µ t ) should be L-smooth in the usual sense of continuous optimization. This property ensures that the PDE defining MFLD has a unique solution [Chi22b, App. A], and is also helpful to ensure convergence of explicit time-discretization schemes [SWN23]. The following proposition gives a practical sufficient condition.
Proposition B.1. Suppose F : P 2 ((Ω, g)) → R is twice differentiable in the Wasserstein sense. Let 0 ≤ L < ∞. Suppose that F satisfies (P1), i.e., ∀µ ∈ P 2 (Ω), ∀ω ∈ Ω, max s∈TωΩ ∥s∥ ω ≤1 ∇ 2 F ′ [µ](s, s) ≤ L and ∀µ, µ ′ ∈ P 2 (Ω), ∀ω ∈ Ω, ∥∇F ′ [µ] -∇F ′ [µ ′ ]∥ ω ≤ L W 2 (µ, µ ′ )
where ∇ 2 denotes the Riemannian Hessian. Then F is 2L-smooth along Wasserstein geodesics.
The first condition can be stated as F ′ [µ] : Ω → R having Lipschitz-continuous gradients in the Riemannian sense [Bou23,Coroll. 10.47], whereas the second condition can be interpreted as a displacement Lipschitz-continuity of µ → F ′ [µ](ω) for each ω uniformly.
Proof. Let a constant-speed Wasserstein geodesic (µ t ) t∈[0,1] ⊂ P 2 (Ω) with W 2 (µ 0 , µ 1 ) = 1, and pose f (t) = F (µ t ). We want to show that f is 2L-smooth in the usual sense of continuous optimization, for which it suffices to show that ∀t, |f ′′ (t)| ≤ 2L.
By [Vil09, Eq. (13.6)] there exist functions ϕ t : Ω → R such that
∂ t µ t = -div(∇ϕ t µ t ) ∂ t ϕ t = -1 2 ∥∇ϕ t ∥ 2 and dµ t ∥∇ϕ t ∥ 2 = W 2 2 (µ 0 , µ 1 ) = 1 for all t.
So we can compute explicitly:
f ′ (t) = d dt F (µ t ) = dµ t ⟨∇F ′ [µ t ], ∇ϕ t ⟩ f ′′ (t) = d(∂ t µ t ) ⟨∇F ′ [µ t ], ∇ϕ t ⟩ + dµ t d dt ∇F ′ [µ t ], d dt ∇ϕ t = dµ t ∇ ⟨∇F ′ [µ t ], ∇ϕ t ⟩ , ∇ϕ t + dµ t ∇F ′ [µ t ], d dt ∇ϕ t + d dt ∇F ′ [µ t ], ∇ϕ t = dµ t ∇ 2 F ′ [µ t ](∇ϕ t , ∇ϕ t ) + dµ t ∇ 2 ϕ t (∇F ′ [µ t ], ∇ϕ t ) + dµ t ⟨∇F ′ [µ t ], ∇∂ t ϕ t ⟩ + dµ t d dt ∇F ′ [µ t ], ∇ϕ t .
Now the first line can be bounded using the first condition of (P1): writing s t (ω) = ∇ϕt(ω) ∥∇ϕt(ω)∥ for all t and ω,
dµ t ∇ 2 F ′ [µ t ](∇ϕ t , ∇ϕ t ) = dµ t ∥∇ϕ t ∥ 2 ∇ 2 F ′ [µ t ](s t , s t ) ≤ L • dµ t ∥∇ϕ t ∥ 2 = L.
Moreover, one can show by direct computation that the second line is zero, using that
∂ t ϕ t = -1 2 ∥∇ϕ t ∥ 2 .
For the third line, we have
dµ t d dt ∇F ′ [µ t ], ∇ϕ t ≤ dµ t ∥∇ϕ t ∥ • sup t∈[0,1] sup ω∈Ω d dt ∇F ′ [µ t ](ω) since dµ t ∥∇ϕ t ∥ 2 ≤ dµ t ∥∇ϕ t ∥ 2 = 1.
Finally, let us show that the second condition of (P1) implies a bound on the last quantity: for all ω ∈ Ω, by applying the assumption to µ = µ t and
µ ′ = µ s , ∥∇F ′ [µ s ](ω) -∇F ′ [µ t ](ω)∥ ω s -t ≤ L W 2 (µ s , µ t ) s -t = L since (µ t ) t is a constant-speed geodesic with W 2 (µ 0 , µ 1 ) = 1. So by letting s → t we obtain that d dt ∇F ′ [µ t ](ω) ≤ L for all t ∈ [0, 1], ω ∈ Ω. Thus we have shown |f ′′ (t)| ≤ 2L
, and so F is 2L-smooth along Wasserstein geodesics.

Section: B.2 Classical sufficient conditions for LSI
For ease of reference we reproduce here a classical sufficient condition for a probability measure µ ∈ P(Ω) to satisfy LSI.

Section: Lemma B.2 (Holley-Stroock bounded perturbation argument [HS86]
). Let µ, µ 0 ∈ P(Ω) such that µ is absolutely continuous w.r.t. µ 0 . Suppose that µ 0 satisfies LSI with constant α and that -M ≤ log dµ dµ0 (ω) + c ≤ M for all ω ∈ supp(µ 0 ), for some c ∈ R and M ≥ 0. Then µ satisfies LSI with constant αe -M .

Section: C Details for Sec. 3.1 (reduction by lifting)
C.1 Proof of Prop. 3.1
Here we present a slightly stronger version of Prop. 3.1 that uses the p-homogeneous projection operator for arbitrary p > 0, in preparation for the next subsection, where we show that one can restrict attention to the case p = 1 as done in the main text.
Recall that we let Ω = R × W. For any p > 0, we denote by h p : P(Ω) → M(W) the signed p-homogeneous projection operator [LMS18] defined by
∀φ ∈ C(W, R), W φ(w)(h p µ)(dw) = Ω sign(r) |r| p φ(w)µ(dr, dw).
More concretely, for atomic measures,
h p 1 m m j=1 δ (rj ,wj ) = 1 m m j=1 sign(r j ) |r j | p δ wj . Lemma C.1. For b ∈ [1, 2] and p > 0, let Ψ b,p : P(Ω) → R ∪ {+∞} defined by Ψ b,p (µ) := Ω |r| pb dµ(r, w) 2/b
if µ ∈ P pb (Ω), and +∞ otherwise. Then
min µ s.t. h p µ=ν Ψ b,p (µ) = ∥ν∥ 2 T V . Moreover, if b = 1 then the set of minimizers is {µ ∈ P(W); h p µ = ν and ∀w, supp(µ(•|w)) ⊂ R + or supp(µ(•|w)) ⊂ R -} , and if b > 1 there is a unique minimizer which is δ f (w) (dr) |ν|(dw) ∥ν∥ T V where f (w) = ∥ν∥ 1/p T V dν d|ν| (w).
Proof. For any µ ∈ P(Ω) such that h p = ν,
∥h p µ∥ T V = max ϕ:W→[-1,1] Ω sign(r) |r| p ϕ(w)dµ(r, w) ≤ Ω |r| p dµ(r, w) so ∥ν∥ 2 T V = ∥h p µ∥ 2 T V ≤ Ω |r| p dµ(r, w) b 2/b ≤ Ω |r| pb dµ(r, w) 2/b = Ψ b,p (µ),
where the first inequality follows from the triangle inequality, and the second inequality follows from Jensen's inequality since t → t b is convex on R + . Note that the first inequality above holds with equality if and only if there exists ϕ : W → [-1, 1] such that sign(r)ϕ(w) ≥ 0 for all (r, w) ∈ supp(µ), i.e., if the conditional distribution µ(dr|w) is either supported on R + or supported on R -for each w. Conversely, the value ∥ν∥
2 T V is attained by letting µ(dr, dw) = δ f (w) (dr) |ν|(dw) ∥ν∥ T V
where f (w) = ∥ν∥
1/p T V dν d|ν| (w). This proves that min µ:h p µ=ν Ψ b,p (µ) = ∥ν∥ 2 T V .
For b = 1, t → t b = t is linear, so equality always holds in Jensen's inequality. So the set of minimizers is all of {µ ∈ P(W); h p µ = ν and ∀w, supp(µ
(•|w)) ⊂ R + or supp(µ(•|w)) ⊂ R -}.
For b > 1, t → t b is strictly convex, the second inequality above holds with equality if and only if there exists a constant c such that |r| p = c for all (r, w) ∈ supp(µ). So for µ to be a minimizer, the conditional distribution µ(dr|w) must be concentrated on {c 1/p , -c 1/p } for each w. Moreover, for the first inequality above to hold, the conditional distribution at each w must be either supported on R + or suported on R -, so there exists a function f : W → {c 1/p , -c 1/p } such that µ(dr, dw) = δ f (w) (dr)µ w (dw) where µ w ∈ P(W) denotes the marginal distribution. Since h p µ = ν, then for all fixed w, R+ sign(r) |r| p µ(dr, dw) = sign(f (w))cµ w (dw) = ν(dw).
So sign(f (w)) = sign( dν dµ w (w)) = dν d|ν| (w) and µ w (dw) = 1 c |ν| (dw) since µ w is a probability measure so non-negative, and integrating on both sides over Ω shows that c = ∥ν∥ T V . Hence the only minimizer is µ(dr,
dw) = δ f (w) (dr) |ν|(dw) ∥ν∥ T V
where f (w) = c 1/p dν d|ν| (w).
Prop. 3.1 follows directly as a special case of the following proposition with p = 1. Proposition C.2. Let any p > 0 and b ∈ [1, 2] and let Ψ b,p : P(Ω) → R ∪ {+∞} as in the lemma above. Consider the optimization problem over probability measures, with λ > 0, min
µ∈P(Ω) F λ,b,p (µ) where F λ,b,p (µ) = G(h p µ) + λ 2 Ψ b,p (µ). (C.1)
Then min P(Ω) F λ,b,p = min M(W) G λ . Moreover, if b > 1 then arg min F = δ ∥ν∥ 1/p T V dν d|ν| (w) (dr) ν(dw) ∥ν∥ T V
; ν ∈ arg min G , and otherwise
arg min F = {µ; h p µ ∈ arg min G and ∀w, supp(µ) ⊂ R + or supp(µ) ⊂ R + }. Furthermore, F is convex.
Proof. The fact that min P(Ω) F λ,b,p = min M(W) G λ can be seen directly as follows:
min µ∈P(Ω) F (µ) = min µ∈P(Ω) G(h p µ) + λ 2 Ψ b,p (µ) = min ν∈M(Ω) min µ∈P(Ω):h p =ν G(h p µ) + λ 2 Ψ b,p (µ) = min ν∈M(Ω) G(ν) + λ 2 min µP(Ω):h p =ν Ψ b,p (µ) = min ν∈M(Ω) G(ν) + λ 2 ∥ν∥ 2 T V = min ν∈M(Ω) G λ (ν)
where we used the lemma above at the fourth equality. The characterization of arg min F in terms of arg min G follows from the characterization of the minimizers of the inner minimization min µ∈P(Ω):h p =ν Ψ b (µ) in the third line, which is given by the lemma above.
Furthermore, F λ,b,p is convex since G and Ψ b,p are.
C.2 Equivalence of using (cp, cq r , cq w , Γ/c 2 ) for any c > 0 by reparametrizing Equivalence of Riemannian structures on Ω * for (cq r , cq w , Γ/c 2 ) for c > 0. Recall that we consider equipping Ω * = R * × W with a Riemannian metric of the form (3.2), reproduced here for ease of reference:
δr 1 δw 1 , δr 2 δw 2 (r,w) = Γ -1 |r| qr δr 1 δr 2 r 2 +|r| qw ⟨δw 1 , δw 2 ⟩ w , i.e., g (r,w) = Γ -1 |r| qr-2 0 0 |r| qw g w .
The following proposition shows that, in fact, different choices of q r , q w and Γ lead to the same geometry, up to a reparametrization of the form (a, w) = (r α , w) (for r > 0). Namely it is equivalent to use the metric with exponents (q r , q w ) or with qr α , qw α , up to adjusting Γ.
Proposition C.3. For any q r , q w , denote by g [qr,qw,Γ] the metric g (r,w
) = Γ -1 |r| qr-2 0 0 |r| qw g w on Ω * = R * × W.
Then for any q r , q w ∈ R and Γ, α > 0, the map
T α : Ω * , g [qr,qw,Γ] → Ω * , g [ qr α , qw α ,α 2 Γ] defined by T α (r, w) = (sign(r) |r| α , w) is an isometry. Proof. Since Ω * is a disjoint manifold: Ω * = R * + ×W ∪ R * -×W, and since T α (R * + ×W) = R * + ×W, it suffices to check that the restricted map T + α : R * + × W, g [qr,qw,Γ] → R * + × W, g [ qr α , qw α ,α 2 Γ]
is an isometry (as well as the analogous statement for the restricted map T - α , but it will follow analogously). Indeed, denote by g the metric on R * + × W induced by T + α . It is given by, for (a, w)
= T + α (r, w) = (r α , w), so da a = α dr r , δr 1 δw 1 • g (r,w) δr 2 δw 2 = δa 1 δw 1 • g(a,w) δa 2 δw 2 = αa 1 r δr 1 δw 1 • g(a,w) αa 1 r δr 2 δw 2 so g(a,w) = r αa 0 0 1 g (r,w) r αa 0 0 1 = r 2 α 2 a 2 Γ -1 r qr-2 0 0 r qw g w = Γ -1 α -2 a qr/α-2 0 0 a qw/α g w . So g is precisely g [ qr α , qw α ,α 2 Γ] on R * + × W, which proves the claim.
Equivalence of the Wasserstein gradient flow of F λ,b,p for (cp, cq r , cq w , Γ/c 2 ) for any c > 0.
Proposition C.4. Let T : (Ω 1 , g [1] ) → (Ω 2 , g [2]
) an isometry between Riemannian manifolds. Let F : P(Ω 1 ) → R (sufficiently regular) and (µ t ) t a Wasserstein gradient flow for F , i.e.,
∂ t µ t = -div(µ t ∇F ′ [µ t ]) (where ∇ denotes Riemannian gradient in (Ω 1 , g [1] )). Then, (μ) t := (T ♯ µ t ) t is a
Wasserstein gradient flow for F :
P(Ω 2 ) → R defined by F (μ) = F (T -1 ♯ μ).
Proof. First note that g [2] is given by, for all y = T (x) ∈ Ω 2 , so dy = DT (x)dx where D denotes the differential,
δy ⊤ g [2]y δy ′ = δx ⊤ g [1]x δx ′ = δy ⊤ ((DT (x)) -1 ) ⊤ g [1]x (DT (x)) -1 δy ′ so g -1 [1]x = (DT (x)) -1 ) g -1 [2]T (x) ((DT (x)) -1 ) ⊤ . Also note that F ′ [μ](y) = F ′ [T -1 ♯ μ](T -1 (y))
, as one can check directly by computing lim ε→0
1 ε F (μ + εν) -F (μ) = lim ε→0 1 ε F (T -1 ♯ μ + εT -1 ♯ ν) -F (T -1 ♯ μ) . In particular D F ′ [μ](y) = DF ′ [T -1 ♯ μ](T -1 (y))(DT (T -1 (y))) -1 . Then for any φ : Ω 2 → R, d dt Ω2 φdμ t = d dt Ω1 φ(T (x))dµ t (x) = Ω1 Dφ(T (x))DT (x) g -1 [1] DF ′ [µ t ](x)dµ t (x) = Ω1 Dφ(y) g -1 [2] D F ′ [μ t ](y)dμ t (y).
That is, 
∂ t μt = -div(μ t g -1 [2] D F ′ [μ t ]), i.e., (
Ω * = R * × W. Fix q r , q w ∈ R, Γ, p, λ > 0 and b ∈ [1, 2]. Let (µ t ) t the Wasserstein gradient flow for F λ,b,p over (Ω * , g [qr,qw,Γ]
), starting from some µ 0 ∈ P(Ω * ).
Let α > 0 and T α : Ω * → Ω * defined by T α (r, w) = (sign(r) |r| α , w). Then (μ t ) t := ((T α ) ♯ µ t ) t coincides with the Wasserstein gradient flow for F λ, b, p over (Ω * , g [qr,qw, Γ] ) starting from μ0 = (T α ) ♯ µ 0 , where
p = p α , qr = q r α , qw = q w α , Γ = α 2 Γ, λ = λ, b = b.
Proof. The proposition follows from an application of Prop. C.4 with
T = T α , Ω 1 = (Ω * , g [qr,qw,Γ] ), Ω 2 = (Ω * , g [q ′
r ,q ′ w ,Γ ′ ] ) and F = F λ,b,p . Indeed the fact that T α is an isometry from Ω 1 to Ω 2 was shown in Prop. C.3. It only remains to show that F • T -1 ♯ = F λ, b, p. And indeed for any μ ∈ P(Ω * ),
F λ,b,p ((T α ) -1 ♯ μ) = F λ,b,p ((T α -1 ) ♯ μ) = G (h p (T α -1 ) ♯ μ) + λ 2 Ψ b,p ((T α -1 ) ♯ μ) ,
and h p (T α -1 ) ♯ μ = h p/α μ, since for any φ :
W → R, W φd [h p (T α -1 ) ♯ μ] = R W φ(w) sign(r) |r| p [(T α -1 ) ♯ μ] (dr, dw) = R W φ(w) sign(r) |r| p/α μ(dr, dw) = W φd h p/α μ ,and
Ψ b,p ((T α -1 ) ♯ μ) = |r| pb d [(T α -1 ) ♯ μ] 2/b = |r| pb/α dμ(r, w) 2/b . This confirms that F • T -1 ♯ = F λ,
b, p and concludes the proof.
Thus, it is equivalent to consider the lifting reduction with the hyperparameters (p, q r , q w , Γ) or with cp, cq r , cq w , Γ/c 2 for any c > 0.
Remark C.1. The choice p = q r = q w plays a special role, as Wasserstein gradient flows (µ t ) t on P(R * + × W) for functionals of the form µ → G(h p µ) then correspond to gradient flows (ν t ) t on M + (W) for G in the Wasserstein-Fisher-Rao geometry [Chi22c, Prop. 2.1], via ν t = h p µ t . This correspondence is lost however for functionals of the form of F λ,b,p as in Prop. C.2 with λ ̸ = 0.
Equivalence of MFLD of F λ,b,p for (cp, cq r , cq w , Γ/c 2 ) for any c > 0. Since MFLD for F λ,b,p is the Wasserstein gradient flow of F λ,b,p + β -1 H, then by Prop. C.4, by proceeding similarly as in the proof of Prop. C.5, it suffices to check that μ → H((T α -1 ) ♯ μ) is equal to H itself, up to an additive constant. And indeed, since T α -1 is invertible, by data processing inequality for differential entropy, we have H((T α -1 ) ♯ μ) = H(μ) for all μ ∈ P(Ω * ).
C.3 Proof of Prop. 3.2 Lemma C.6. Let F λ,b,p defined in (C.1) and Ω = W × R. For any µ ∈ P(Ω),
F ′ λ,b,p [µ](r, w) = sign(r) |r| p G ′ [h p µ](w) + λ ′ |r| pb (C.2)
where
λ ′ = λ 1 b Ψ b,p (µ) 1-b 2 .
Proof. For any µ ′ ∈ P(Ω),
lim ε→0 1 ε [(G • h p )(µ + εµ ′ ) -(G • h p )(µ)] = lim ε→0 1 ε [G(h p µ + εh p µ ′ ) -G(h p µ)] = W G ′ [h p µ](w)d [h p µ ′ ] (w) = R×W sign(r) |r| p G ′ [h p µ](w)dµ ′ (r, w)
and so
(G • h p ) ′ [µ](r, w) = sign(r) |r| p G ′ [h p µ](w). Moreover Ψ b,p (µ) = Ω |r| pb dµ(r, w) 2 b Ψ ′ b,p [µ](r, w) = 2 b Ω |r| ′pb dµ(r ′ , w ′ ) 2 b -1 |r| pb = 2 b Ψ b,p (µ) 1-b 2 |r| pb .
Summing the results of these two calculations gives the first variation of
F λ,b,p = G•h p + λ 2 Ψ b,p .
Lemma C.7. Let f : R * + × W → R defined by f (r, w) = r p φ(w) + λ ′ r pb , for some p, λ ′ > 0, b ∈ [1, 2], and φ : W → R. Assume that ∇ 2 φ is not constant equal to 0.

Section: Consider R *
+ × W equipped with the Riemannian metric (3.2). If f has Lipschitz-continuous Riemannian gradients, then necessarily b = 1 and p = q r = q w , or b = 1 and p = q r /2 = q w /2 and ∇ 2 φ(w) = Γp 2 φ(w) + λ ′ g w for all w.
The proof of Lem. C.7 is technical, so it is deferred to the next section.
Proof of Prop. 3.2. Let us prove the first item in the proposition. Suppose by contraposition that F λ,b does satisfy (P1). Let any ν ∈ M(W) such that ∇ 2 G ′ [ν] is not constant equal to 0, and consider some µ ∈ P(Ω) to be chosen such that hµ = ν. Then by the first condition of (P1),
f := F ′ λ,b [µ] R * + ×W the restriction of F ′ λ,b [µ] to R * + × W must have Lipschitz-continuous Riemannian gradients. More explicitly, by (C.2), f (r, w) = rG ′ [ν](w) + λ ′ µ r b where λ ′ µ = λ b Ψ b (µ) 1-b 2 . So by Lem. C.7, necessarily b = 1, and so λ ′ µ = λΨ 1 (µ) 1/2 . If φ := G ′ [ν] satisfies ∇ 2 φ(w) = Γp 2 φ(w) + λ ′ µ g
w for all w, pick any other µ ′ such that hµ ′ = ν and Ψ 1 (µ ′ ) ̸ = Ψ 1 (µ) -the existence of such a µ ′ follows from the first step in the proof of Lem. C.1. Then by applying the above reasoning to
F ′ λ,b [µ ′ ] R * + ×W instead of f , since λ ′ µ ′ ̸ = λ ′ µ ,
we also have by Lem. C.7 that p = q r = q w . This shows that if F λ,b satisfies (P1) then (q r , q w , b) = (1, 1, 1), which was the announced necessary condition.
We now turn to the second item of the proposition. Suppose that q r = q w = b = 1. For any µ ∈ P 1 (Ω), denote
λ 0µ = sup w∈W |G ′ [hµ](w)| Ψ 1 (µ) 1/2 .
Let us show that if λ < λ 0µ , then F λ,1 does not satisfy local LSI at µ. Suppose that λ < λ 0µ , i.e., there exists w 0 ∈ W such that
Ψ 1 (µ) 1/2 λ < |G ′ [hµ](w 0 )| . Let us distinguish cases between G ′ [hµ](w 0 ) ≥ 0 or G ′ [hµ](w 0 ) < 0. First suppose G ′ [hµ](w 0 ) ≥ 0, so that Ψ 1 (µ) 1/2 λ < G ′ [hµ](w 0 ). By continuity of G ′ [hµ], let N ⊂ W an open neighborhood of w 0 such that ∀w ∈ N, Ψ 1 (µ) 1/2 λ < G ′ [hµ](w). Then, since F ′ λ,1 [µ](r, w) = |r| sign(r)G ′ [hµ](w) + λΨ 1 (µ) 1/2 by (C.2), ∀r ∈ R -, ∀w ∈ N, F ′ λ,1 [µ](r, w) = |r| -G ′ [hµ](w) + λΨ 1 (µ) 1/2 ≤ 0 and so R W e -βF ′ λ,1 [µ](r,w) drdw ≥ R-N e -βF ′ λ,1 [µ](r,w) drdw ≥ R-N 1 drdw = +∞.
This contradicts the exponential integrability condition in the definition of local LSI, and so F λ,1 does not satisfy local LSI at µ.
Likewise, now suppose that G ′ [hµ](w 0 ) < 0, so that Ψ 1 (µ) 1/2 λ < -G ′ [hµ](w 0 ). By continuity of G ′ [hµ], let N ⊂ W an open neighborhood of w 0 such that ∀w ∈ N, Ψ 1 (µ) 1/2 λ < -G ′ [hµ](w). Then ∀r ∈ R + , ∀w ∈ N, F ′ λ,1 [µ](r, w) = |r| G ′ [hµ](w) + λΨ 1 (µ) 1/2 ≤ 0 and so R W e -βF ′ λ,1 [µ](r,w) drdw ≥ R+ N e -βF ′ λ,1 [µ](r,w) drdw ≥ R+ N 1 drdw = +∞.
As in the previous case, we conclude that F λ,1 does not satisfy local LSI at µ.

Section: C.4 Proof of Lem. C.7 via computing the Hessians under the lifted Riemannian geometry
We start by a general lemma. We use D to denote differentials, and for a function f : R * + × W → R, we will write D r f = ∂f (r,w) ∂r and D w f = ∂f (r,w) ∂w . Lemma C.8. Let (W, g) a Riemannian manifold. Let Ω * + = R * + × W and consider
g (r,w) = α(r) -1 0 0 β(r) -1 g w
for smooth positive functions α, β : R * + → R * + . This defines a smooth Riemannian metric g on Ω * + . Denote by g (r,w) , ∇, Γ, ∇ 2 the Riemannian metric, gradient, Christoffel symbols, resp. Hessian on Ω * + , and by g w , ∇, Γ, ∇ 2 the corresponding objects on the original space W. Let f : Ω * + → R a smooth scalar field. Write for convenience f r (w) = f (r, w), so that for example ∇f r (w) = g -1 w D w f (r, w), and note that D r ∇f r (w) = ∇D r f r (w). Fix a local coordinate chart on W. This induces a local coordinate chart on Ω * + by adding the index 0 for the variable r. Then the Riemannian Hessian f at (r, w) is given in coordinates by
∇ 2 f 00 = α(r) 2 D 2 rr f + 1 2 α(r)α ′ (r)D r f ∇ 2 f i0 = ∇ 2 f 0i = α(r)β(r)∇D r f r (w) i + 1 2 α(r)β ′ (r)∇f r (w) i ∇ 2 f ij = β(r) 2 ∇ 2 f r (w) ij - 1 2 α(r)β ′ (r) • D r f • (g -1 w ) ij .
Proof. We will use uppercase letters for indes ranging over [0, d] and lowercase for [1, d], with the index 0 corresponding to the variable r; for example ∇f (r, w) 0 = α(r)D r f (r, w). We will use Einstein summation notation freely. With slight abuse of notation we denote (g ij ) ij = g -1 for the inverse matrix of the metric (g ij ) ij = g, and likewise for g IJ , g IJ , so that for example g 00 = α(r).
We start by using that [Lee18, Example 4.22, Eq. (5.10)]
∇ 2 f (r, w) IJ = g IK g JL ∂ 2 f ∂ω K ∂ω L -Γ M KL ∂f ∂ M ω and Γ M IJ = 1 2 g M K ∂g KI ∂ω J + ∂g KJ ∂ω I - ∂g IJ ∂ω K
where ω = (r, w), and that the analogous formulas hold for f r : W → R for all r and for Γ m ij the Christoffel symbols of W.
By direct computations, we find that for all i, j, m ∈ [1, d],
Γ 0 00 = - 1 2 α ′ (r) α(r) Γ 0 i0 = Γ 0 0i = 0 Γ 0 ij = 1 2 α(r) β ′ (r) β(r) 2 g ij Γ m 00 = 0 Γ m i0 = Γ m 0i = - 1 2 β ′ (r) β(r) δ m i Γ m ij = Γ m ij .
So by direct computations, we find that
∇ 2 f 00 = α(r) 2 D 2 rr f + 1 2 α(r)α ′ (r)D r f ∇ 2 f i0 = ∇ 2 f 0i = α(r)β(r)∇D r f r (w) i + 1 2 α(r)β ′ (r)∇f r (w) i ∇ 2 f ij = β(r) 2 ∇ 2 f r (w) ij - 1 2 α(r)β ′ (r) • D r f • g ij ,
as announced.
Corollary C.9. Let f :
Ω * + = R * + × W → R defined by f (r, w) = r p φ(w) + λ ′ r pb , for some p > 0, b ∈ [1, 2], λ ′ ≥ 0 and φ : W → R.
Consider Ω * + equipped with the Riemannian metric (3.2). Then the Riemannian Hessian of f is given in coordinates by
∇ 2 f 00 = Γ 2 p(p -q r /2)r 2-2qr+p φ(w) + Γ 2 pbλ ′ (pb -q r /2)r 2-2qr+pb ∇ 2 f i0 = ∇ 2 f 0i = Γ(p -q w /2)r 1-qr-qw+p ∇ φ(w) i ∇ 2 f ij = r p-2qw ∇ 2 φ(w) ij + 1 2 Γq w r -qr-qw • pr p φ(w) + pbλ ′ r pb (g -1 w ) ij .
Proof. Continuing with the same notations as in the proof of the lemma above, we have
D r f = pr p-1 φ(w) + pbλ ′ r pb-1 D 2 rr f = p(p -1)r p-2 φ(w) + pb(pb -1)λ ′ r pb-2 ∇f r (w) i = r p ∇ φ(w) i ∇ 2 f r (w) ij = r p ∇ 2 φ(w) ij ∇D r f r (w) i = pr p-1 ∇ φ(w) i
and so
∇ 2 f 00 = α(r) 2 p(p -1)r p-2 φ(w) + pb(pb -1)λ ′ r pb-2 + 1 2 α(r)α ′ (r) pr p-1 φ(w) + pbλ ′ r pb-1 = α(r)p α(r)(p -1) + 1 2 rα ′ (r) r p-2 φ(w) + α(r)pbλ ′ α(r)(pb -1) + 1 2 rα ′ (r) r pb-2 ∇ 2 f i0 = ∇ 2 f 0i = α(r)β(r) • pr p-1 ∇ φ(w) i + 1 2 α(r)β ′ (r) • r p ∇ φ(w) i = α(r) β(r)p + 1 2 rβ ′ (r) r p-1 ∇ φ(w) i ∇ 2 f ij = β(r) 2 • r p ∇ 2 φ(w) ij - 1 2 α(r)β ′ (r) • pr p-1 φ(w) + pbλ ′ r pb-1 • g ij .
By substituting α(r) -1 = Γ -1 r qr-2 and β(r) -1 = r qw , i.e. α(r) = Γr 2-qr and β(r) = r -qw , we obtain the announced formulas.
Proof of Lem. C.7. Continuing with the same notations as in the proofs of the lemma and of the corollary above, note that f : Ω * + = R * + × W → R having Lipschitz-continuous gradients in the Riemannian sense is equivalent to [Bou23,Coroll. 10.47]
sup ω∈Ω * + sup s∈TωΩ * + ∥s∥ ω =1 ∇ 2 f (ω) IJ g JK s K ω < ∞.
Rewriting everything in coordinates, this means that the matrix
H(ω) = √ g IK ∇ 2 f (ω) IJ √ g JL KL
∈ R (d+1)×(d+1) must be bounded, uniformly in ω ∈ Ω * + , where ( √ g IJ ) IJ = √ g denotes the square root of the positive-definite matrix g (pointwise for each ω). Concretely, for all i, j ∈ [1, d],
g 00 = α(r) -1/2 = Γ -1/2 r qr/2-1 , g i0 = 0, g ij = β(r) -1/2 √ g ij = r qw/2 √ g ijand
H(ω) 00 = g 00 ∇ 2 f 00 = Γp(p -q r /2)r -qr+p φ(w) + Γpbλ ′ (pb -q r /2)r -qr+pb H(ω) j0 = g 00 g ji ∇ 2 f i0 = Γ 1/2 (p -q w /2)r -qr/2-qw/2+p • √ g ji ∇ φ(w) i H(ω) kl = g ki g lj ∇ 2 f ij = r p-qw • √ g ki √ g lj ∇ 2 φ(w) ij + Γ 1 2 q w r -qr • pr p φ(w) + pbλ ′ r pb δ kl .
(Note that here the indes do not respect the covariant/contravariant convention, i.e., " √ g IK " and " H(ω) KL " do not stand for covariant tensors: we really manipulate everything in coordinates explicitly.)
Now, note that the desired condition means that H(ω) KL should remain bounded both for r → +∞ and r → 0. That is, the exponents of r in the non-zero terms must all be 0. Thus, since we assume that λ ′ ̸ = 0, and that ∇ 2 φ is not constant equal to 0 and so in particular φ and ∇ φ are not constant,
• Uniform boundedness of the second term in H(ω) kl implies that b = 1. Indeed λ ′ ̸ = 0, and the first term (in ∇ 2 φ) cannot cancel out both the term in φ(w)r p-qr and the term in λ ′ r pb-qr if they scale differently with r. This also implies that either p = q w = q r or that q w = q r and ∇ 2 φ(w) ij = 1 2 Γq w p φ(w) + λ ′ g ij for all w.
• Uniform boundedness of H(ω) 00 implies that p = q r or p = q r /2.
• Uniform boundedness of H(ω)j0 implies that p = qr+qw 2 or p = q w /2. We saw in the first point that q r = q w , so equivalently p = q r = q w or p = q r /2 = q w /2.
Thus we get that f can have Lipschitz-continuous Riemannian gradients only if b = 1 and p = q r = q w , or if b = 1 and p = q r /2 = q w /2 and ∇ 2 φ(w) = Γp 2 φ(w) + λ ′ g w for all w. In preparation for the proof of Prop. 3.3, let us first provide a formal proof of the variational representation of the squared-TV norm mentioned at the beginning of Sec. 3.2, with a characterization of the set of minimizers. See [Chi17, App. 1] for the rigorous justification of these arguments in the more general context of minimization of convex and positively 1-homogeneous integral functionals over the space of signed measures. Lemma D.1 ("η-trick" for the squared TV-norm). We have
∥ν∥ 2 T V = W |ν(dw)| 2 = inf η∈P(W) W |ν(dw)| 2 η(dw) = inf η∈P(W), f :W→R s.t. f η=ν W |f | 2 dη.
Moreover the infimum in the third expression is attained at (and only at)
η(dw) = |ν(dw)| ∥ν∥ T V
, and the infimum in the fourth expression is attained at (and only at) the same η and f = ν(dw) |ν(dw)| ∥ν∥ T V .
Proof. The infimum in the third expression is the value of a convex constrained minimization problem, whose Lagrangian is L(η; λ) = |ν| 2 η + λ dη -1 . The dual optimality condition implies ∀w ∈ supp(η), λ = dν dη (w) 2 , so the infinimum is attained at η(dw
) = |ν(dw)| ∥ν∥ T V
, with optimal value ∥ν∥ 2 T V . The optimality condition for the infimum in the fourth expression follows directly from the one for the third expression and from the constraint f η = ν.
Proof of Prop. 3.3. By the lemma above,
inf η∈P(W) J λ (η) = inf η∈P(W),f :W→R G(f η) + λ 2 W |f | 2 dη = inf ν∈M(W) inf η∈P(W), f :W→R s.t. f η=ν G(f η) + λ 2 W |f | 2 dη = inf ν∈M(W) G(ν) + λ 2   inf η∈P(W), f :W→R s.t. f η=ν W |f | 2 dη   = inf ν∈M(W) G(ν) + λ 2 ∥ν∥ 2 T V = inf ν∈M(W) G λ (ν).
Hence the equality of the optimal values. The claimed characterization of arg min J λ in terms of arg min G λ follows from the characterization of the minimizers of the inner minimization
inf η∈P(W), f :W→R s.t. f η=ν λ 2 W |f |
2 dη in the third line, which is given by the lemma above.
Furthermore, J λ is convex as the partial minimization of (η, ν) → G(ν)
+ λ 2 |ν| 2
η , which is jointly convex.

Section: D.2 Proof of the explicit form of the two-timescale SDE (3.4)
For ease of reference, we recall here the two-timescale SDE (3.4):
∀i ≤ N,    dr i t = -Γ ∇ r i F ′ λ,2 1 N N j=1 δ (r j t ,w j t ) (r i t , w i t )dt dw i t = -∇ w i F ′ λ,2 1 N N j=1 δ (r j t ,w j t ) (r i t , w i t )dt + 2β -1 dB i t .
By (C.2) with b = 2 and p = 1,
F ′ λ,2 [µ](r, w) = rG ′ [hµ](w) + λ 2 |r| 2 so ∇ r F ′ λ,2 [µ](r, w) = G ′ [hµ](w) + λr and ∇ w F ′ λ,2 [µ](r, w) = r∇G ′ [hµ](w).
Finally, by definition h 1 N N j=1 δ (r j ,w j ) = 1 N N j=1 r j δ w j . Hence the second part of (3.4).
D.3 Proof of Prop. 3.4 (J λ satisfies P0, P1 and P2)
Simplifying the expression of the bilevel objective. The following expressions will be useful throughout our analyses of the bilevel problem (3.3). Proposition D.2. We have that
J λ (η) = G(f η η) + λ 2 |f η | 2 dη where f η is the unique solution of the fixed-point equation ∀w ∈ W, f η (w) = - 1 λ G ′ [f η η](w). (D.1)
Furthermore,
J ′ λ [η](w) = - λ 2 |f η | 2 (w). (D.2)
Proof. Consider the optimization problem defining J λ (η), for a fixed η,
min f ∈L 2 η (W) G(f η) + λ 2 W |f | 2 dη.
This problem is convex since G is, and strongly convex in L 2 η (W) thanks to the term in λ. So there exists a unique solution which we denote by fη ∈ L 2 η (W), and it is characterized by the first-order optimality condition:
G ′ [ fη η] η + λ fη η = 0 in M(W). Now let f η = -1 λ G ′ [ fη η],
which is defined over all of W. Then f η satisfies the fixed-point equation (D.1) by construction. Conversely, for any solution g η of (D.1), its restriction to supp(η) viewed as an element gη of L 2 η (W) must in particular satisfy G ′ [g η η]η + λg η η = 0 in M(W), and so gη = fη , and so
g η = -1 λ G ′ [g η η] = -1 λ G ′ [ fη η] = f η .
Furthermore, by differentiability of G then η → fη is continuous (in the total variation sense). So in turn, η → f η (w) the unique solution of (D.1) is continuous for each w (in the total variation sense). So by the envelope theorem, since for any fixed f the first variation of η → G(f η)
+ λ 2 |f | 2 dη is w → f (w)G ′ [f η](w) + λ 2 |f (w)| 2 , J ′ λ [η](w) = f η (w)G ′ [f η η](w) + λ 2 |f η (w)| 2 = - λ 2 |f η (w)| 2 = - 1 2λ |G ′ [f η η]| 2 (w),
which is precisely Eq. (D.2).
We remark that the above manipulations rely crucially on the fact that the optimization problem (1.1) is over signed measures and not just non-negative measures -as otherwise we would additionally need to constrain f ≥ 0 -, and on the regularization term being ∥ν∥ 2 T V instead of ∥ν∥ T V .
Preliminary estimates. Lemma D.3. Under Assumption 1, for any ν ∈ M(W), we have
sup w∈W |G ′ [ν](w)| 2 ≤ 2L 0 G(ν). Proof. We follow the proof technique of [GGGM21, Appendix D]. Let w 0 ∈ W and ν ′ = ν - 1 L0 G ′ [ν](w 0 )δ w0 . By mean-value theorem there exists θ ∈ (0, 1) such that G(ν ′ ) -G(ν) = G ′ [ν + θ(ν ′ -ν)]d(ν ′ -ν), and so inf G ≤ G(ν ′ ) ≤ G(ν) + G ′ [ν]d(ν ′ -ν) + L 0 2 ∥ν ′ -ν∥ 2 T V = G(ν) - 1 L 0 G ′ [ν](w 0 ) 2 + 1 2L 0 G ′ [ν](w 0 ) 2 = G(ν) - 1 2L 0 G ′ [ν](w 0 ) 2 .
Hence, since G is non-negative by Assumption 1,
∀w ∈ W, 1 2L 0 G ′ [ν](w) 2 ≤ G(ν) -inf G ≤ G(ν)
Lemma D.4. Under Assumption 1, let η ∈ P(W) and let f η as in (D.1). Then
sup W |f η | ≤ 1 λ 2L 0 J λ (η)
and for each i ∈ {1, 2},
sup w∈W ∇ i f η w ≤ L i λ 2 2L 0 J λ (η) + B i λ .
Moreover, J λ (η) ≤ G(0) for all η ∈ P(W).
Proof. For the first inequality, by definition G ′ [f η η] = -λf η for all w ∈ W, so
λ 2 |f η (w)| 2 = |G ′ [f η η](w)| 2 ≤ 2L 0 G(f η η) ≤ 2L 0 G(f η η) + λ 2 |f η | 2 dη = 2L 0 J λ (η)
where the first inequality follows from Lem. D.3.
For the second part, by Assumption 1, ∀ν ∈ M(W),
sup w ∇ i G ′ [ν] w ≤ L i ∥ν∥ T V + B i , so λ ∇ i f η w = ∇ i G ′ [f η η] w ≤ B i + L i ∥f η η∥ T V = B i + L i |f η | dη ≤ B i + L i sup W |f η | ≤ B i + L i 1 λ 2L 0 J λ (η)
by the first part of the lemma.
Finally, the uniform bound on J λ (η) follows by taking f = 0 in the infimum defining
J λ : J λ (η) = inf f ∈L 2 η G(f η) + λ 2 |f | 2 dη ≤ G(0).
Lemma D.5. Under Assumption 1, J λ : P(W) → R is weakly continuous and
∀η, η ′ ∈ P(W), |J λ (η) -J λ (η ′ )| ≤ BW 2 (η, η ′ )
where
B = 2L 0 G(0) • L1 λ 2 2L 0 G(0) + B1 λ .
Proof. For any η ∈ P(W), letting f η as in (D.1), we have
J ′ λ [η](w) = -λ 2 |f η | 2 (w) so ∇J ′ λ [η](w) = -λf η (w)∇f η (w) ∥∇J ′ λ [η](w)∥ w ≤ λ sup W |f η | • sup W ∥∇f η ∥ ≤ λ • 1 λ 2L 0 G(0) • L 1 λ 2 2L 0 G(0) + B 1 λ =: B < ∞
by Lem. D.4, uniformly in η ∈ P(W) and w ∈ W. So by Lem. D.8 below, we have
|J λ (η) -J λ (η ′ )| ≤ BW 2 (η, η ′
) for all η, η ′ ∈ P(W). Moreover W 2 metrizes weak convergence, so J λ is weakly continuous.
Lemma D.6. Under Assumption 1, let w ′ ∈ W and η ∈ P(W). Let h : W → R and suppose that
∀w ∈ W, λh(w) + G ′′ [f η η](w, w ′′ )dη(w ′′ )h(w ′′ ) = -G ′′ [f η η](w, w ′ )f η (w ′ ).
Then
sup w∈W |h(w)| ≤ 1 + L0 λ L0 λ 2L 0 G(0).
Alternatively, suppose that there exists s ∈ T w ′ W with ∥s∥ w ′ = 1 such that
∀w ∈ W, λh(w) + G ′′ [f η η](w, w ′′ )dη(w ′′ )h(w ′′ ) = -⟨s ′ , ∇ w ′ [G ′′ [f η η](w, w ′ )f η (w ′ )]⟩ w ′ .
Then
sup w∈W |h(w)| ≤ 1 + L0 λ • 1 + L0 λ L1 λ 2L 0 G(0) + L0B1 λ . Proof. Let G : L 2 η (W) → L 2 η (W) the operator (G h)(w) = G ′′ [f η η](w, w ′′ )dη(w ′′ ) h(w ′′ ).
G is well-defined as a bounded operator, since Assumption 1 implies that
|G ′′ [f η η](w, w ′ )| ≤ L 0 . Note that G ′′ [f η η](w, w ′′
) is symmetric in w and w ′′ , and that by convexity of G, G ′′ [f η η](w, w ′′ ) ≥ 0 for all w, w ′′ . Consequently, G is a symmetric positive-semi-definite operator from L 2 η (W) to itself. On the other hand, let
V 1 (•) = -G ′′ [f η η](•, w ′ )f η (w ′ ). By Lem. D.4 we have ∥V 1 ∥ L 2 η ≤ sup W |V 1 | ≤ sup W×W |G ′′ [f η η]| • sup W |f η | ≤ L 0 • 1 λ 2L 0 G(0) =: V 1 . Also let V 2 (•) = -⟨s ′ , ∇ w ′ [G ′′ [f η η](•, w ′ )f η (w ′ )]⟩ w ′ .
Then by Lem. D.4,
∥V 2 ∥ L 2 η ≤ sup W |V 2 | ≤ sup w,w ′ ∥∇ w ′ G ′′ [f η η](w, w ′ )∥ • sup W |f η | + sup W×W |G ′′ [f η η]| • sup W ∥∇f η ∥ ≤ L 1 • 1 λ 2L 0 G(0) + L 0 • L 1 λ 2 2L 0 G(0) + B 1 λ = 1 + L 0 λ L 1 λ 2L 0 G(0) + L 0 B 1 λ =: V 2 .
Denote by h the restriction of h to supp(η) viewed as an element of L 2 η (W). Then, denoting by id the identity operator on L 2 η (W), we may rewrite the assumption as (λ id +G) h = V j for j = 1 or 2, and so
|h| 2 dη = h L 2 η = (λ id +G) -1 V j L 2 η ≤ λ -1 ∥V j ∥ L 2 η ≤ λ -1 V j
since G is positive-semi-definite and λ > 0. Thus for any w ∈ W, we get the point-wise bound
λh(w) = V j (w) -dη(w ′′ )G ′′ [f η η](w, w ′′ )h(w ′′ ) λ |h(w)| ≤ |V j (w)| + dη(w ′′ ) |G ′′ [f η η](w, w ′′ )| |h(w ′′ )| ≤ V j + ∥G ′′ [f η η](w, •)∥ L 2 η ∥h∥ L 2 η ≤ V j + L 0 • λ -1 V j .
Lemma D.7. Under Assumption 1, let η, η ′ ∈ P(W) and let f η , f η ′ as in (D.1). Then there exist constants H, H ′ dependent only on λ -1 , G(0) and
L 0 , L 1 , B 1 , L 2 such that sup W |f η -f η ′ | ≤ HW 2 (η, η ′ ) and sup w∈W ∥∇f η -∇f η ′ ∥ w ≤ H ′ W 2 (η, η ′ ).
Proof. For each w ∈ W, we denote the first variation of η → f η (w) by w ′ → δfη(w) δη(dw ′ ) . Let us show that this quantity is uniformly bounded. 8 By definition, for any w ∈ W and η ∈ P(W) and w ′ ∈ W,
λf η (w) + G ′ [f η η](w) = 0 so λ δf η (w) δη(w ′ ) + G ′′ [f η η](w, w ′ )f η (w ′ ) + (G ′′ [f η η](w, •)) d η δf η (•) δη(w ′ ) = 0 λ δf η (w) δη(w ′ ) + G ′′ [f η η](w, w ′′ )η(dw ′′ ) δf η (w ′′ ) δη(w ′ ) = -G ′′ [f η η](w, w ′ )f η (w ′ ). (D.
3) 8 The rigorous proof that the first variation δfη (w) δη(dw ′ ) is well-defined for all w, w ′ ∈ W and η ∈ P(W) would follow from the same derivations as for the uniform bound, so we omit it here.
So by Lem. D.6 applied to h = δfη(•) δη(w ′ ) , we indeed have that δfη(w) δη(w ′ ) is bounded by a constant uniformly in w, w ′ and η.

Section: Let us now show that
sup w∈W sup η∈P(W) sup w ′ ∈W ∇ w ′ δf η (w) δη(dw ′ ) w ′ ≤ H
for a constant H depending only on λ -1 , L 0 , L 1 , B 1 , G(0). Indeed, it suffices to show that for any s ′ ∈ T w ′ W such that ∥s ′ ∥ w ′ = 1, s ′ , ∇ w ′ δfη(w) δη(dw ′ ) w ′ ≤ H. Now, starting from (D.3) -which holds for all w, w ′ , η -and differentiating with respect to w ′ in the direction s ′ , we get that
λ s ′ , ∇ w ′ δf η (w) δη(w ′ ) w ′ + G ′′ [f η η](w, w ′′ )η(dw ′′ ) s ′ , ∇ w ′ δf η (w ′′ ) δη(w ′ ) w ′ = -⟨s ′ , ∇ w ′ [G ′′ [f η η](w, w ′ )f η (w ′ )]⟩ w ′
and so h(w) = s ′ , ∇ w ′ δfη(w) δη(dw ′ ) w ′ satisfies the conditions of Lem. D.6, which proves the claim.
Next let us show that
sup w∈W sup s∈TwW ∥s∥ w =1 sup η∈P(W) sup w ′ ∈W ∇ w ′ δ ⟨s, ∇f η (w)⟩ w δη(dw ′ ) w ′ ≤ H ′
for a constant H ′ depending only on λ -1 , L 0 , L 1 , B 1 , G(0) and L 2 . Indeed, starting from (D.3) and differentiating with respect to w ′ in the direction s ′ , and differentiating with respect to w in the direction s, we get
λ s ′ , ∇ w ′ δ ⟨s, ∇f η (w)⟩ w δη(w ′ ) w ′ + ∇ w G ′′ [f η η](w, w ′′ )η(dw ′′ ) s ′ , ∇ w ′ δf η (w ′′ ) δη(w ′ ) w ′ = -s, ∇ w ⟨s ′ , ∇ w ′ [G ′′ [f η η](w, w ′ )f η (w ′ )]⟩ w ′ w
and so
λ ∇ w ′ δ ⟨s, ∇f η (w)⟩ w δη(dw ′ ) w ′ ≤ ∥∇ w ∇ w ′ G ′′ [f η η]∥ • |f η (w ′ )| + ∥∇ w G ′′ [f η η]∥ w • ∥∇f η (w ′ )∥ w ′ + sup w ′′ ∈W ∥∇ w G ′′ [f η η](w, w ′′ )∥ w • sup w ′′ ∈W ∇ w ′ δf η (w ′′ ) δη(dw ′ ) w ′ ≤ L 2 • 1 λ 2L 0 G(0) + L 1 • L 1 λ 2 2L 0 G(0) + B 1 λ + L 1 • H =: H ′ by Assumption 1.
Now fix w ∈ W. By Lem. D.8 below applied to F (η) = f η (w), we have that
|f η (w) -f η ′ (w)| ≤ sup η ′′ ∈P(W) sup w ′ ∈W ∇ w ′ δf η ′′ (w) δη ′′ (dw ′ ) w ′ W 2 (η, η ′ ) ≤ HW 2 (η, η ′ ).
Likewise, fix any w ∈ W and let s =
∇f η ′ (w)-∇fη(w) ∥∇f η ′ (w)-∇fη(w)∥ w ∈ T w W. Then by Lem. D.8 below applied to F (η) = ⟨s, ∇f η (w)⟩ w , ∥∇f η ′ (w) -∇f η (w)∥ = ⟨s, ∇f η ′ (w)⟩ w -⟨s, ∇f η (w)⟩ w ≤ H ′ W 2 (η, η ′ ).
Lemma D.8. Let W a compact Riemannian manifold and F : P(W) → R such that
∀η ∈ P(W), ∀w ∈ W, ∥∇F ′ [η](w)∥ w ≤ B. Then ∀η, η ′ ∈ P(W), |F (η) -F (η ′ )| ≤ BW 1 (η, η ′ ) ≤ BW 2 (η, η ′ ).
Proof. For any x, y ∈ W, pose (Σ θ (x, y)) θ∈[0,1] the constant-speed length-minimizing geodesic in W interpolating between x and y. Also pose Σ ′ θ (x, y) = d dθ Σ θ (x, y) ∈ T Σ θ (x,y) W for any θ. For example if W = R d , Σ θ (x, y) = x + θ(yx) and Σ ′ θ (x, y) = yx for all θ. Let γ the optimal coupling between η, η ′ in the W 1 sense, and for all θ ∈ [0, 1], η θ = (Σ θ ) ♯ γ the pushforward measure of γ by Σ θ . Note that for any θ ∈ [0, 1],
d dθ F (η θ ) = W F ′ [η θ ]d (∂ θ η θ )
and that
∀φ : W → R, d dθ W φdη θ = d dθ W×W φ(Σ θ (x, y))dγ(x, y) = W×W d dθ φ(Σ θ (x, y))dγ(x, y) = W×W ⟨Σ ′ θ (x, y), ∇φ(Σ θ (x, y))⟩ Σ θ (x,y) dγ(x, y).
(The interchange of d dθ and W×W on the second line can be justified by the dominated convergence theorem assuming that φ has bounded C 1 norm, which is the case of F ′ [η θ ] by assumption.) So by Cauchy-Schwarz inequality,
d dθ F (η θ ) = W×W ⟨Σ ′ θ (x, y), ∇F ′ [η θ ](Σ θ (x, y))⟩ Σ θ (x,y) dγ(x, y) d dθ F (η θ ) ≤ W×W ∥Σ ′ θ (x, y)∥ Σ θ (x,y) • ∥∇F ′ [η θ ](Σ θ (x, y))∥ Σ θ (x,y) dγ(x, y) ≤ sup w∈W sup η ′ ∈P(W) ∥∇F ′ [η](w)∥ w • W×W ∥Σ ′ θ (x, y)∥ Σ θ (x,y) dγ(x, y) ≤ B • W×W dist(x, y)dγ(x, y) = BW 1 (η, η ′ )
by definition of the geodesic (Σ θ (x, y)) θ∈[0,1] and by definition of the optimal coupling γ. Finally,
|F (η) -F (η ′ )| = 1 0 d dθ F (η θ ) dθ ≤ sup θ∈[0,1] d dθ F (η θ ) ≤ BW 1 (η, η ′ ).
Proof of the Proposition.
Proof of Prop. 3.4. We first check (P0). The fact that J λ is convex is given by Prop. 3.3. Moreover, let any β > 0 and let us check that J λ,β := J λ + β -1 H (•|τ ) has a minimizer. Indeed, J λ,β is weakly continuous as shown in Lem. D.5, and non-negative so lower-bounded. Since W is compact then any set of probability measures on W is tight, i.e., any sequence in P(W) has a weakly convergent subsequence. So we conclude by the direct method of calculus of variations: let a sequence (η n ) n such that J λ,β (η n ) → inf P(W) J λ,β and extract a weakly convergent subsequence with limit η ∞ ; then by weak continuity η ∞ is a minimizer of J λ,β .
We now show that J λ satisfies (P1). Recall from (D.2) that J
′ λ [η](w) = -λ 2 |f η | 2 (w) with f η = -1 λ G ′ [f η η] over W.
Let us show the first condition for (P1):
∀η ∈ P 2 (W), ∀w ∈ W, max s∈TwW ∥s∥ w ≤1 ∇ 2 J ′ λ [η](s, s) ≤ Λ
for some Λ < ∞, where ∇ 2 denotes the Riemannian Hessian. We have
∇J ′ λ [η](w) = -λf η (w)∇f η (w) ∇ 2 J ′ λ [η](w) = -λf η (w)∇ 2 f η (w) -λ∇f η (w)∇ ⊤ f η (w)
and so, for all s ∈ T w W such that ∥s∥ w ≤ 1,
∇ 2 J ′ λ [η](s, s) ≤ λ |f η | ∇ 2 f η + λ ∥∇f η ∥ 2 ≤ 2L 0 G(0) L 2 λ 2 2L 0 G(0) + B 2 λ + λ L 1 λ 2 2L 0 G(0) + B 1 λ 2 by Lem. D.4.
Let us now check the second condition for (P1), namely that
∀w ∈ W, ∀η, η ′ ∈ P 2 (W), ∥∇J ′ λ [η] -∇J ′ λ [η ′ ]∥ w ≤ Λ W 2 (η, η ′ ) for some Λ < ∞. Indeed, ∥∇J ′ λ [η] -∇J ′ λ [η ′ ]∥ w = λ ∥f η ∇f η -f η ′ ∇f η ′ ∥ ≤ λ (∥f η (∇f η -∇f η ′ )∥ + ∥(f η -f η ′ )∇f η ′ ∥) ≤ λ sup η ′′ sup W |f η ′′ | • sup W ∥∇f η -∇f η ′ ∥ + sup η ′′ sup W ∥∇f η ′′ ∥ • sup W |f η -f η ′ | ≤ λ 1 λ 2L 0 G(0) • H ′ W 2 (η, η ′ ) + L 1 λ 2 2L 0 G(0) + B 1 λ • HW 2 (η, η ′ ) =: ΛW 2 (η, η ′ )
by Lem. D.4 and Lem. D.7.
We now turn to the proof of (P2) with the quantitative bound on the local LSI constant. Let η ∈ P(W). By the first part of Lem. D.4, we directly have that
|J ′ λ [η](w)| = λ 2 |f η | 2 (w) ≤ L 0 λ J λ (η).
In particular, by the Holley-Stroock bounded perturbation argument [HS86], the proximal Gibbs measure η := e -βJ ′ λ [η] τ /Z satisfies LSI with constant α η = α τ exp -1 λ L 0 βJ λ (η) . Finally, we turn to the proof of the bound on the uniform LSI constant along the MFLD trajectory (η t ) t≥0 . Given the bound on the local LSI constants, it suffices to show that ∀η ∈ P(W), J λ (η) ≤ G(0) and ∀t ≥ 0, J λ (η t ) ≤ J λ (η 0 ) + β -1 H (η 0 |τ ) .
The first bound was shown in Lem. D.4. For the second bound, note that J λ (η t ) + β -1 H (η t |τ ) decreases with t, since MFLD is precisely the Wasserstein gradient flow for η → J λ (η) + β -1 H(η) and H(η) and H (η|τ ) differ by a constant. So, since relative entropy is non-negative,
J λ (η t ) ≤ J λ (η t ) + β -1 H (η t |τ ) ≤ J λ (η 0 ) + β -1 H (η 0 |τ )
for all t ≥ 0, as desired.

Section: E Details for Sec. 4 (global convergence by annealing)
The following preliminary lemma allows to control the effect of entropic regularization, using a box-kernel smoothing technique similar to [Chi22a]. Lemma E.1. Let W a d-dimensional compact Riemannian manifold and denote by τ the uniform probability measure over W. Let J : P(W) → R and η * ∈ P(W), and suppose that there exist constants A, B > 0 such that
∀η s.t. W 1 (η, η * ) ≤ A, J (η) -J (η * ) ≤ BW ∞ (η, η * ). Denote J β = J + β -1 H (•|τ ), for any β > 0. Then min η:W1(η,η * )≤A J β (η) ≤ J (η * ) + inf 0<ϵ≤min{1,A} Bϵ + d β log 1 ϵ + log C β where C := inf w∈W inf 0<ϵ≤1 ϵ -d • τ ({w ′ ; dist W (w, w ′ ) ≤ ϵ}) -1 .
Proof. The proof is adapted from [Chi22a]. It is based on constructing an ϵ-smoothed version of η * , i.e. a measure η ϵ which admits a density w.r.t. τ while being close to η * in an appropriate sense.
Let any 0 < ϵ ≤ min{1, A}. Given w ∈ W, define the probability measure γ ϵ,w (dw ′ ) as the uniform probability measure over the geodesic ball B ϵ (w) := {w ∈ W; dist(w, w ′ ) ≤ ϵ}. In other words, dγϵ,w dτ (w ′ ) := 1(w ′ ∈Bϵ(w)) τ (Bϵ(w)) . Then, let γ ϵ (dw, dw ′ ) = η * (dw)γ ϵ,w (dw ′ ) ∈ P(W × W), and let η ϵ (dw ′ ) = w∈W γ ϵ (dw, dw ′ ) its second marginal.
One can then verify that
dη ϵ dτ (w ′ ) = w∈W dγ ϵ,w dτ (w ′ )η * (dw) = w∈W 1(w ′ ∈ B ϵ (w)) τ (B ϵ (w)) η * (dw).
Moreover there exists a positive constant
C such that τ (B ϵ (w)) ≥ C -1 ϵ d for all ϵ ≤ 1 [GV79, Theorem 3.3]. As a consequence, H (η ϵ |τ ) = dη ϵ (w ′ ) log dη ϵ dτ (w ′ ) ≤ sup w∈W -log τ (B ε (w)) ≤ d log(1/ϵ) + log C.
Furthermore, by definition of the coupling γ ϵ , we have
W 1 (η ϵ , η * ) ≤ W ∞ (η ϵ , η * ) ≤ ϵ ≤ A.
Therefore, by assumption J (η ϵ ) -J (η * ) ≤ BW ∞ (η ϵ , η * ) ≤ Bϵ, and so
min η:W1(η,η * )≤A J β (η) ≤ J β (η ϵ ) = J (η ϵ ) + β -1 H (η ϵ |τ ) ≤ J (η * ) + Bϵ + β -1 (d log(1/ϵ) + log C) ,
and the inequality of the lemma follows by taking the infimum over ϵ. 
T ∆ ≤ 2d α τ ∆J * λ log 4C 1/d B ∆J * λ • exp 4dL 0 G(0) λ∆J * λ log 4C 1/d B ∆J * λ • log 2J λ (η 0 ) ∆J * λ + H (η 0 |τ ) 2 log C where C = max 1, inf w∈W inf 0<ϵ≤1 ϵ -d • τ ({w ′ ; dist W (w, w ′ ) ≤ ϵ}) -1 .
Proof of Prop. E.2. Let (η) t the MFLD-Bilevel trajectory with constant inverse temperature parameter β to be chosen. Denote J λ,β = J λ + β -1 H (•|τ ). Recall that by Prop. 3.4, J λ,β satisfies α β -LSI uniformly along the MFLD trajectory with α β = α τ exp -1 λ L 0 βG(0) . So by Thm. 2.1, for all t,
J λ (η t ) ≤ J λ,β (η t ) ≤ inf J λ,β + e -2β -1 α β t (J λ,β (η 0 ) -inf J λ,β ) ≤ inf J λ,β + e -2β -1 α β t J λ,β (η 0 ),
where in the first inequality we used that J λ,β -
J λ = β -1 H (•|τ ) ≥ 0.
Furthermore, by applying Lem. E.1 to
J = J λ , η * = arg min J λ , A = ∞ and B = 2L 0 G(0) • L1 λ 2 2L 0 G(0) + B1 λ
the constant from Lem. D.5, we find that
inf J λ,β ≤ inf J λ + inf 0<ϵ≤1 Bϵ + d β log 1 ϵ + log C β .
Taking β = d B s for some s ≥ 1 to be chosen, and evaluating at the infimum at ϵ = d βB , we get
inf J λ,β ≤ J * λ + d + log C ′ β - d β log d βB .
where C ′ = max{1, C}. So in order to guarantee that J λ (η t ) ≤ (1 + ∆)J * λ , it suffices to take t such that
J * λ + d + log C ′ β - d β log d βB + e -2β -1 α β t J λ (η 0 ) + β -1 H (η 0 |τ ) ≤ (1 + ∆)J * λ i.e. t ≥ β 2α β log   J λ (η 0 ) + β -1 H (η 0 |τ ) ∆J * λ -d+log C ′ β -d β log d βB   =: T s ,
assuming that ∆ is large enough so that the above expression is well-defined. More explicitly, substituting the value of α β and of β = d B s, we have
T s = β 2α τ • exp 1 λ L 0 βG(0) • log   J λ (η 0 ) + β -1 H (η 0 |τ ) ∆J * λ -d+log C ′ β -d β log d βB   = sd/B 2α τ • exp s 1 λB L 0 dG(0) • log J λ (η 0 ) + B sd H (η 0 |τ ) ∆J * λ -B s (1 + d -1 log C ′ + log s)
.
Noting that
log s∆J * λ 4B = log s -log 4B ∆J * λ ≤ s∆J * λ 4B -1 so B s 1 + d -1 log C ′ + log s ≤ B s d -1 log C ′ + log 4B ∆J * λ + s∆J * λ 4B = B s d -1 log C ′ + log 4B ∆J * λ + ∆J * λ 4 , choose henceforth s = max 1, 4B ∆J * λ d -1 log C ′ + log 4B ∆J * λ , so that ∆J * λ - B s 1 + d -1 log C ′ + log s ≥ ∆J * λ 2 .
To simplify the final statement, we make the assumption that ∆ is small enough so that 1
≤ 4B ∆J * λ d -1 log C ′ + log 4B ∆J * λ .
More explicitly, since we were careful to choose
C ′ ≥ 1, 1 ≤ 4B ∆J * λ d -1 log C ′ + log 4B ∆J * λ ⇐⇒ ∆J * λ 4B + log ∆J * λ 4B ≤ d -1 log C ′ ⇐= ∆J * λ 4B ≤ 1 and log ∆J * λ 4B ≤ -1 ⇐⇒ ∆J * λ 4B ≤ min{1, e -1 } = e -1 ⇐⇒ ∆ ≤ 4Be -1 J * λ = 4e -1 J * λ • 2L 0 G(0) L 1 λ 2 2L 0 G(0) + B 1 λ ⇐= ∆ ≤ 4e -1 J * λ • 2L 0 L 1 G(0) λ 2 ⇐= ∆ ≤ 1 J * λ • 2L 0 L 1 G(0) λ 2 . Then s = 4B ∆J * λ d -1 log C ′ + log 4B ∆J * λ , β = 4d ∆J * λ d -1 log C ′ + log 4B ∆J * λ ≥ 4 ∆J * λ log C ′ ,and
T s ≤ β 2α τ • exp 1 λ L 0 βG(0) • log J λ (η 0 ) + β -1 H (η 0 |τ ) ∆J * λ /2 ≤ 2d α τ ∆J * λ log 4C ′1/d B ∆J * λ • exp 4dL 0 G(0) λ∆J * λ log 4C ′1/d B ∆J * λ • log 2J λ (η 0 ) ∆J * λ + H (η 0 |τ ) 2 log C ′ =: T ∆ .
Hence the time-complexity upper bound of T ∆ for reaching (1 + ∆)-multiplicative accuracy.
Algorithm 1 Annealing of the MFLD. Require: Functional J : P(W) → R. Initialization η 0 , β 0 > 0. Schedule K, (T k ) K k=0 . 1: η 0 0 = η 0 2: for k = 0, . . . , K do 3:
β k = 2 k β 0 4: Run the MFLD with β k initialized from η k 0 up to T k , ∂ t η k t = div(η k t ∇J ′ [η k t ]) + 1 β k ∆η k t .
5:
η k+1 0 = η k T k 6: end for 7: return η K T K .

Section: E.2 General annealing procedure and its convergence guarantee
The following theorem builds upon and generalizes the idea of [SWON23,Sec. 4.1] to objective functionals J that have a positive optimal value. It ensures fast convergence to a fixed multiplicative accuracy.
Theorem E.3. Let W a d-dimensional compact Riemannian manifold, so in particular the uniform measure τ over W satisfies α τ -LSI for some α τ > 0. Let J : P(W) → R + convex, suppose that J * := min J > 0 and that there exists a minimizer η * . Suppose that there exist constants
κ 1 , C L , A > 0 such that 1. ∥J ′ [η]∥ ∞ ≤ κ 1 J (η) for all η ∈ P(W). 2. J (η) -J (η * ) ≤ C L W ∞ (η, η * ) for all η ∈ P(W) such that W 1 (η, η * ) ≤ A. Fix 0 < δ ≤ C L min{1,A} J *
. Let η k t the iterates of the annealing procedure of Algorithm 1 with initialization β 0 = d and with the schedule K = ⌈log 2 (1/(δJ * ))⌉ and
T k = 2 k-1 d log 2 k J β0 (η 0 ) •α -1 τ exp 2κ 1 d δ -1 + log C L C 1/d δJ * + 2 + J β0 (η 0 ) 2 (E.1)
where
C := inf w∈W inf 0<ϵ≤1 ϵ -d • τ ({w ′ ; dist W (w, w ′ ) ≤ ϵ}) -1 . Then J (η K T K ) ≤ J * 1 + 3δ + 2δ log C L C 1/d δJ *
, and the total time-complexity is given by
K k=0 T k ≤ d δJ * log J β0 (η 0 ) δJ * • α -1 τ exp 2κ 1 d δ -1 + log C L C 1/d δJ * + 2 + J β0 (η 0 ) 2 .
Let us discuss the assumptions of Thm. E.3 and possible generalizations.
• Note that the condition 2. of the theorem holds as soon as
J ′ [η] : W → R is C L -Lipschitz for all η ∈ P(W), as shown in Lem. D.8, since W 1 ≤ W 2 ≤ W ∞ .
• The annealing procedure and its convergence guarantee can be generalized to a non-compact manifold W by modifying MFLD to include a confining potential term, as discussed in Sec. A.2. • Condition 1. of the theorem actually holds for any J such that sup η,w,w ′ |J ′′ [η](w, w ′ )| ≤ L < ∞ and J * > 0, with the constant κ 1 = 2L J * . Indeed, one can then show similarly to Lem. D.3 that
∥J ′ [η]∥ 2 ∞ ≤ 2L (J (η) -J * ) ≤ 2LJ (η) ≤ 2L J (η) 2 J * .
However, when plugging in κ 1 = 2L/J * into the bounds of the theorem, one obtains a less favorable dependency of the total time-complexity in J * . In particular, note that the total time-complexity guaranteed by the theorem scales exponentially in κ 1 and polynomially in 1/J * .
• The way that the condition 1. of the theorem comes into the proof, is that it allows to guarantee a local LSI constant of J + β -1 t H at η t of α ηt = cst • e -κ1βtJ (ηt) . One could similarly formulate an annealing procedure, and state convergence guarantees, tailored to objectives J that satisfy different criteria for LSI, such as the Bakry-Emery curvaturedimension criterion.
The remainder of this subsection is dedicated to proving Thm. E.3.
Proof of Thm. E.3. Fix any 0 < δ ≤ C L min{1,A} J * . Let, for any β > 0, J β = J + β -1 H (•|τ ).
By condition 1. of the theorem and the Holley-Stroock bounded perturbation argument, for any t, k, the proximal Gibbs measure η k t ∝ e -β k J ′ [η k t ] τ satisfies LSI with the constant
α τ exp -β k κ 1 J (η k t ) ≥ inf t ′ ≥0 α τ exp -β k κ 1 J (η k t ′ ) =: α(k).
That is, for any k, J β k satisfies α(k)-LSI at η k t for all t ≥ 0. (To see that α(k) > 0, note that for any
k, t, J (η k t ) ≤ J β k (η k t ) ≤ J β k (η k 0 ), since H (•|τ ) is non-negative and (η k t ) t is a Wasserstein gradient flow of J β k , and so α(k) = inf t≥0 α τ exp -β k κ 1 J (η k t ) ≥ α τ exp -β k κ 1 J β k (η k 0 )
> 0; but we will not make use of this rough bound in the sequel.) Now let
T k = β k 2α(k) log β k d c k for some α(k) ≤ α(k) and c k ≥ J β k (η k 0 ) -min J β k to be chosen.
Then by Thm. 2.1 applied to J β k , we obtain
J β k (η k T k ) ≤ min J β k + exp -2β -1 k α(k)T k • J β k (η k 0 ) -min J β k ≤ min J β k + β k d J β k (η k 0 ) -min J β k -1 • J β k (η k 0 ) -min J β k = min J β k + d β k .
Further, by Lem. E.1,
J β k (η k T k ) ≤ J * + inf 0<ϵ≤min{1,A} C L ϵ + d β k log 1 ϵ + log C β k + d β k ≤ J * (1 + δ) + d β k log C L δJ * + d + log C β k , (E.2)
where the last inequality follows by choosing ϵ
= δJ * C L ≤ min{1, A} since δ ≤ C L min{1,A} J * .
Then, for all k ≥ 1 and t ≥ 0,
β k J (η k t ) ≤ β k J β k (η k t ) ≤ β k J β k (η k 0 ) = β k J β k (η k-1 T k-1 ) ≤ β k J β k-1 (η k-1 T k-1 ) = 2β k-1 J β k-1 (η k-1 T k-1 ),
where we used successively that
J β k -J = β -1 k H (•|τ ) ≥ 0, that (η k t ) t is a Wasserstein gradient flow for J β k , that J β k-1 -J β k = (β -1 k-1 -β -1 k )H (•|τ ) ≥ 0 since (β k ) k is increasing, and that by definition β k = 2 k β 0 . So by (E.2), β k J (η k t ) ≤ 2β k-1 J β k-1 (η k-1 T k-1 ) ≤ 2β k-1 J * (1 + δ) + 2d log C L δJ * + 2d + 2 log C ≤ 2 d δ (1 + δ) + 2d log C L δJ * + 2d + 2 log C = 2d δ -1 + log C L δJ * + 2 + log C d since our choice of β 0 = d and K = ⌈log 2 (1/(δJ * ))⌉ ensures that β k-1 ≤ β K = 2 K β 0 ≤ d δJ * .
For k = 0 and all t ≥ 0, we have more simply β 0 J (η 0 t ) ≤ β 0 J β0 (η 0 t ) ≤ β 0 J β0 (η 0 ) = dJ β0 (η 0 ). As a result, for all k ≥ 0 we have
∀t ≥ 0, β k J (η k t ) ≤ 2d δ -1 + log C L δJ * + 2 + log C d + 1 2 J β0 (η 0 )
and so
α(k) = inf t≥0 α τ exp -κ 1 β k J (η k t ) ≥ α τ exp -2κ 1 d δ -1 + log C L δJ * + 2 + log C d + 1 2 J β0 (η 0 ) =: α(k).
Moreover, we can choose c k as
J β k (η k 0 ) = J β k (η k-1 T k-1 ) ≤ J β k-1 (η k-1 T k-1 ) ≤ J β k-1 (η k-1 0 ) ≤ ... ≤ J β0 (η 0 ) by induction, so J β k (η k 0 ) -min J β k ≤ J β0 (η 0 ) =: c k . Therefore, more explicitly, T k = β k 2α(k) log β k d c k = β k 2 log β k d J β0 (η 0 ) • α -1 τ exp 2κ 1 d δ -1 + log C L δJ * + 2 + log C d + 1 2 J β0 (η 0 ) = 2 k-1 d • log 2 k J β0 (η 0 ) • α -1 τ exp 2κ 1 d δ -1 + log C L δJ * + 2 + log C d + 1 2 J β0 (η 0 ) since β k = 2 k β 0 = 2 k d. Note that K k=0 2 k-1 log 2 k J β0 (η 0 ) = K k=0 2 k log J β0 (η 0 ) 2 + K k=0 k2 k-1 log(2) = (2 K+1 -1) log J β0 (η 0 ) 2 + log(2) (K -1)2 K + 1 ≤ 2 K log J β0 (η 0 ) + log(2)K2 K ≤ 1 δJ * log J β0 (η 0 ) + 1 δJ * log 1 δJ * = 1 δJ * log J β0 (η 0 ) δJ * since K = ⌈log 2 (1/(δJ * ))⌉
, hence the announced bound on the total time-complexity
K k=0 T k . Finally, at round K = ⌈log 2 (1/(δJ * ))⌉, then β K = 2 K β 0 = 2 K d ∈ 1 2 d δJ * , d δJ * , so by (E.2), J (η K T K ) ≤ J β K (η K T K ) ≤ J * (1 + δ) + d β K log C L δJ * + d + log C β K ≤ J * 1 + 3δ + 2δ log(C) d + 2δ log C L δJ * ,
which completes the proof.

Section: E.3 Proof of Thm. 4.2
We state a slightly more precise version of Thm. 4.2 below, and prove it as a corollary of the more general Thm. E.3. Then Thm. 4.2 follows by choosing δ = Θ( ∆ log(B/(∆J * λ )) ), gathering the constants appearing in the bounds, noting that J λ,β0 (η 0 ) ≤ J λ (η 0 ) + dH (η 0 |τ ) ≤ G(0) + dH(η 0 ) + d log vol(W).
Theorem E.4. Under Assumption 1, there exists constants B = poly(L i , B i , G(0), λ -1 ) and C dependent only on W such that the following holds. For any δ ≤ B J * λ , MFLD-Bilevel with the temperature schedule (β t ) t≥0 defined by ∀k ≤ K, ∀t ∈ [t k , t k+1 ], β t = 2 k d where t 0 = 0 and K = ⌈log 2 (1/(δJ * ))⌉ and
t k+1 -t k = 2 k-1 d log 2 k J λ,β0 (η 0 ) •α -1 τ exp 2L 0 d λ δ -1 + log BC 1/d δJ * λ + 2 + J λ,β0 (η 0 ) 2 ,
achieves (1 + ∆)-multiplicative accuracy, where ∆ = 3δ + 2δ log BC 1/d δJ * λ , with time-complexity
T ∆ ≤ t K+1 ≤ d δJ * λ log J λ,β0 (η 0 ) δJ * λ •α -1 τ exp 2L 0 d λ δ -1 + log BC 1/d δJ * λ + 2 + J λ,β0 (η 0 ) 2 .
Proof of Thm. 4.2 . Let us show that the conditions of Thm. E.3 are satisfied, under Assumption 1, for J = J λ . J λ is convex and non-negative, and it is implied throughout Sec. 4.1 that inf J λ > 0, for the notion of convergence to a fixed multiplicative accuracy to apply (Def. 4.1). The existence of a minimizer η * is ensured by the weak convexity of J λ , by a similar argument as the proof of (P0) in Sec. D.3. We have the condition 1. with
κ 1 = L0 λ , i.e. ∥J ′ λ [η]∥ ∞ ≤ L0 λ J λ (η)
, by the first part of Lem. D.4. We also have condition 2. with A = ∞ and
C L = B := 2L 0 G(0) • L1 λ 2 2L 0 G(0) + B1 λ , as shown in Lem. D.5, since W 1 ≤ W 2 ≤ W ∞ .
Note that annealed MFLD-Bilevel with the announced temperature annealing schedule (β t ) t , precisely corresponds to Algorithm 1 with the schedule (E.1) applied to J = J λ . So the announced timecomplexity bound follows directly from the application of Thm. E.3.

Section: F Details for Sec. 5 (estimates of the local LSI constant)
We begin by presenting the proof of Prop. 5.1, which states that bounding the LSI constant of η λ,β leads to a local convergence rate.
Proof of Prop. 5.1. For any η ∈ P(W), we denote η(dw) = e -βJ ′ λ [η](w) τ (dw)/Z η where Z η = e -βJ ′ λ [η] dτ . First note that for any η, η ′ ∈ P(W),
log dη dη ′ (w) + (log Z η -log Z η ′ ) = β |J ′ λ [η](w) -J ′ λ [η ′ ](w)| = β λ 2 f η (w) 2 -f η ′ (w) 2 ≤ β λ 2 (|f η | + |f η ′ |) (w) • |f η -f η ′ | (w) ≤ β λ 2 • 2 1 λ 2L 0 G(0) • HW 2 (η, η ′ ) =: HW 2 (η, η ′ )
by Lem. D.4 and Lem. D.7, where H is a constant dependent only on
λ -1 , G(0), L 0 , L 1 , B 1 , L 2 . Now suppose that η λ,β = arg min J λ,β = η λ,β satisfies α * -LSI. Let ε > 0 and η 0 in the δ-sublevel set of J λ,β , i.e., η 0 ∈ S δ := J -1 λ,β ((-∞, inf J λ,β + δ]
), for some δ > 0 to be chosen. Denote by (η t ) t the MFLD trajectory for J λ,β initialized at η 0 . Note that S δ is stable by MFLD since J λ,β (η t ) decreases with t. So it suffices to show that J λ,β satisfies (α *ε)-LSI uniformly over S δ .
Choose any η ∈ S δ , i.e., such that J λ,β (η)inf J λ,β ≤ δ. In particular by Thm. 2.1, it holds
β -1 H (η|η λ,β ) ≤ J λ,β (η) -inf J λ,β ≤ δ.
Furthermore, since η λ,β satisfies LSI with constant α * then it also satisfies the following Talagrand inequality, as shown in [OV00]:
∀η ′ , W 2 (η ′ , η λ,β ) ≤ 2 α * H (η ′ |η λ,β ).
Then by the inequality noted above, we have
log dη dη λ,β (w) + c ≤ HW 2 (η, η λ,β ) ≤ H 2 α * H (η|η λ,β ) ≤ H 2 α * • βδ =: M √ δ
for some c ∈ R, and so by the Holley-Stroock bounded perturbation argument, η satisfies LSI with constant α * e -M √ δ ≥ α *ε for δ small enough.
F.1 Preliminary estimates for J λ under Assumption 2
Throughout the remainder of this appendix, in the context of Assumption 2, we will use the notations
• the Hilbert space H = L 2 ρ (R d+1 ) with the inner product ⟨f, g⟩ H = E x∼ρ f (x)g(x), • the feature map ϕ : W → H given by ϕ(w)(x) = φ(⟨w, x⟩),
• the symmetric positive-semi-definite operator in H: K η = ϕ(w)ϕ(w) * dη(w), where * denotes adjoint in H. • For any h ∈ H, we denote by ⟨h, ∇ϕ(w)⟩ H (resp. h, ∇ 2 ϕ(w) H ) the gradient (resp.
Hessian) at w of w → ⟨h, ϕ(w)⟩ H .
The usefuless of these notations is justified by Prop. F.1 below, which gives a simplified expression for J λ and J ′ λ . Proposition F.1. Under Assumption 2, letting the Hilbert space H = L 2 ρ (R d+1 ) and the feature map ϕ : W → H given by ϕ(w)(x) = φ(⟨w, x⟩), we have
J λ (η) = λ 2 ⟨y, (K η + λ id) -1 y⟩ H , J ′ λ [η](w) = - λ 2 ⟨ϕ(w), (K η + λ id) -1 y⟩ 2 H , with K η = ϕ(w)ϕ(w) * dη(w), where * denotes adjoint in H. More explicitly, K η is the integral operator of the kernel k η (x, x ′ ) = φ(⟨w, x⟩)φ(⟨w, x ′ ⟩)dη(w) with respect to the distribution x ∼ ρ, i.e., ∀h ∈ H = L 2 ρ (R d+1 ), (K η h)(x) = E x ′ ∼ρ [k η (x, x ′ )h(x ′ )] in L 2 ρ .
Proof. Under Assumption 2 we have
G(ν) = 1 2 E x∼ρ W φ(⟨w, x⟩)dν(w) -y(x) 2 = 1 2 W ϕ(w)dν(w) -y 2 H
, so the optimization problem (3.3) defining J λ (η), for a fixed η, writes
min f ∈L 2 η (W) 1 2 W ϕ(w)f (w)dη(w) -y 2 H + λ 2 W |f | 2 (w)dη(w).
This problem is strictly convex thanks to the term in λ, and the FOC is ∀w, ϕf dηy, ϕ(w)η(dw) H + λf (w)η(dw) = 0. So the unique minimum f η is a solution of the fixed point equation
f (w) = -1 λ ϕf dη -y, ϕ(w) H in L 2 η (W). In particular, denoting ĥη = -1 λ ϕf η dη -y , then f η (w) = ĥη , ϕ(w) H and, integrating against ϕη, W f η (w)ϕ(w)dη(w) = W ϕ(w) ϕ(w) * ĥη dη(w) ⇐⇒ -λ ĥη + y = K η ĥη ⇐⇒ (K η + λ id) ĥη = y ⇐⇒ ĥη = (K η + λ id) -1 y,
where a * b = ⟨a, b⟩ H and K η = W ϕ(w)ϕ(w) * dη(w). So the optimal value J λ (η) is
J λ (η) = 1 2 W ϕ(w)f η (w)dη(w) -y 2 H + λ 2 W |f η | 2 (w)dη(w) (F.1) = 1 2 λ ĥη 2 H + λ 2 W ĥ * η ϕ(w) ϕ(w) * ĥη dη(w) = 1 2 λ ĥη , λ ĥη H + λ 2 ĥη , K η ĥη H = 1 2 λ ĥη , λ ĥη + K η ĥη H = 1 2 λ ĥη , y H = λ 2 ⟨y, (K η + λ id) -1 y⟩ H .
Further, by applying the envelope theorem on (F.1) (and reasoning similarly to the proof of Prop. D.2 to deal with w ̸ ∈ supp(η), by extending
f η ∈ L 2 η (W) into a function W → R), we then have ∀w ∈ W, J ′ λ [η](w) = ϕf η dη -y, ϕ(w)f η (w) H + λ 2 |f η | 2 (w) = f η (w) -λ ĥη , ϕ(w) H + λ 2 |f η | 2 (w) = -λ |f η | 2 (w) + λ 2 |f η | 2 (w) = - λ 2 |f η | 2 (w) = - λ 2 ĥη , ϕ(w) 2 H
.
The characterization of K η as the integral operator in
L 2 ρ (R d+1 ) of the kernel k η (x, x ′ ) = W ϕ(w)(x) ϕ(w)(x ′ )dη(w) follows directly from the definition K η = W ϕ(w)ϕ(w) * dη(w), since ∀h ∈ H, K η h = W ϕ(w) ⟨ϕ(w), h⟩ H dη(w), (K η h)(x) = W ϕ(w)(x) E x ′ ∼ρ [ϕ(w)(x ′ )h(x ′ )] dη(w) = E x ′ ∼ρ W ϕ(w)(x)ϕ(w)(x ′ ) h(x ′ ) dη(w) = E x ′ ∼ρ [k η (x, x ′ )h(x ′ )] .
We have the following Wasserstein Lipschitz-continuity properties for the bilevel objective functional J λ . Proposition F.2. Under Assumption 2, suppose furthermore that sup w ∇ i ϕ(w) H ≤ B i < ∞ for i ∈ {0, 1, 2}. Then for any w ∈ W = S d and any η, η ′ ∈ P(W), it holds
|J λ (η) -J λ (η ′ )| ≤ B 0 B 1 λ ∥y∥ 2 H • W 1 (η, η ′ ) and |J ′ λ [η](w) -J ′ λ [η ′ ](w)| ≤ 2B 3 0 B 1 λ 2 ∥y∥ 2 H • W 1 (η, η ′ ) and ∥∇J ′ λ [η](w) -∇J ′ λ [η ′ ](w)∥ w ≤ 4B 2 0 B 2 1 λ 2 ∥y∥ 2 H • W 1 (η, η ′ ) and ∇ 2 J ′ λ [η](w) -∇ 2 J ′ λ [η ′ ](w) op w ≤ 4B 0 B 1 (B 0 B 2 + B 2 1 ) λ 2 ∥y∥ 2 H • W 1 (η, η ′ ).
Proof. By Prop. F.1,
J ′ λ [η](w) = - λ 2 ϕ(w), (K η + λ id) -1 y 2 H where K η = W ϕ(w ′′ )ϕ(w ′′ ) * dη(w ′′ ) so ∇J ′ λ [η](w) = -λ ϕ(w), (K η + λ id) -1 y H ∇ϕ(w), (K η + λ id) -1 y H (F.2) ∥∇J ′ λ [η]∥ w ≤ λ ∥ϕ(w)∥ H ∥∇ϕ(w)∥ w (K η + λ id) -1 y 2 H ≤ λB 0 B 1 ∥y∥ 2 H (K η + λ) -1 2 op ≤ 1 λ B 0 B 1 ∥y∥ 2 H
since K η is positive-semi-definite by definition and so
(K η + λ) -1 op = σ max ((K η + λ id) -1 ) = [σ min (K η + λ id)]
-1 ≤ λ -1 . So by applying Lem. D.8, this shows the first inequality.
Moreover, the first variation of K η at any η is w ′ → ϕ(w ′ )ϕ(w ′ ) * , thus by the formula ∂(X -1 ) = -X -1 (∂X)X -1 for the derivative of a matrix inverse,
δ δη(w ′ ) (K η + λ id) -1 = -(K η + λ id) -1 • ϕ(w ′ )ϕ(w ′ ) * • (K η + λ id) -1 ,
and so, letting for concision M = (K η + λ id) -1 ,
J ′′ λ [η](w, w ′ ) = -λ ϕ(w), (K η + λ id) -1 y H ϕ(w), -(K η + λ id) -1 • ϕ(w ′ )ϕ(w ′ ) * • (K η + λ id) -1 y H = -λ ⟨ϕ(w), M y⟩ H ⟨ϕ(w), -M • ϕ(w ′ )ϕ(w ′ ) * • M y⟩ H = λ ⟨ϕ(w), M y⟩ H ⟨ϕ(w), M ϕ(w ′ )⟩ H ⟨ϕ(w ′ ), M y⟩ H .
As a result,
∇ w J ′′ λ [η](w, w ′ ) = λ ⟨ϕ(w ′ ), M y⟩ H • (⟨∇ϕ(w), M y⟩ • ⟨ϕ(w), M ϕ(w ′ )⟩ H + ⟨ϕ(w), M y⟩ • ⟨∇ϕ(w), M ϕ(w ′ )⟩ H )
and, using again that
∥M ∥ op = (K η + λ) -1 op ≤ λ -1 , ∥∇ w J ′′ λ (w, w ′ )∥ w ≤ λB 0 λ -1 ∥y∥ H • 2B 2 0 B 1 λ -2 ∥y∥ H = 2λ -2 B 3 0 B 1 ∥y∥ 2 H .
Then applying Lem. D.8 shows the second inequality.
Furthermore, for a fixed w ∈ W, continuing from the expression of
∇ w J ′′ λ [η](w, w ′ ) derived above, ∇ w ′ ∇ w J ′′ λ [η](w, w ′ ) = λ ⟨∇ϕ(w ′ ), M y⟩ H • (⟨∇ϕ(w), M y⟩ • ⟨ϕ(w), M ϕ(w ′ )⟩ H + ⟨ϕ(w), M y⟩ • ⟨∇ϕ(w), M ϕ(w ′ )⟩ H ) + λ ⟨ϕ(w ′ ), M y⟩ H • (⟨∇ϕ(w), M y⟩ • ⟨ϕ(w), M ∇ϕ(w ′ )⟩ H + ⟨ϕ(w), M y⟩ • ⟨∇ϕ(w), M ∇ϕ(w ′ )⟩ H ) , so ∥∇ w ′ ∇ w J ′′ λ [η](w, w ′ )∥ ≤ 4λ -2 B 2 0 B 2 1 ∥y∥ 2 H
, and the third inequality follows by applying Lem. D.8 to η → ⟨s, ∇J ′ λ [η](w)⟩ w for s ∈ T w W arbitrary. Finally, by differentiating the expression of ∇ w ′ ∇ w J ′′ λ [η](w, w ′ ) once more with respect to w we get that, for any fixed w ∈ W,
∇ w ′ ∇ 2 w J ′′ λ [η](w, w ′ ) = λ ⟨∇ϕ(w ′ ), M y⟩ H • ∇ 2 ϕ(w), M y • ⟨ϕ(w), M ϕ(w ′ )⟩ H + ⟨∇ϕ(w), M y⟩ • ⟨∇ϕ(w), M ϕ(w ′ )⟩ H + λ ⟨∇ϕ(w ′ ), M y⟩ H • ⟨∇ϕ(w), M y⟩ • ⟨∇ϕ(w), M ϕ(w ′ )⟩ H + ⟨ϕ(w), M y⟩ • ∇ 2 ϕ(w), M ϕ(w ′ ) H + λ ⟨ϕ(w ′ ), M y⟩ H • ∇ 2 ϕ(w), M y • ⟨ϕ(w), M ∇ϕ(w ′ )⟩ H + ⟨∇ϕ(w), M y⟩ • ⟨∇ϕ(w), M ∇ϕ(w ′ )⟩ H + λ ⟨ϕ(w ′ ), M y⟩ H • ⟨∇ϕ(w), M y⟩ • ⟨∇ϕ(w), M ∇ϕ(w ′ )⟩ H + ⟨ϕ(w), M y⟩ • ∇ 2 ϕ(w), M ∇ϕ(w ′ ) H , hence ∇ w ′ ∇ 2 w J ′′ λ [η](w, w ′ ) ≤ λ -2 ∥y∥ 2 B 0 B 1 (4B 2 B 0 + 4B 2 1 )
, and the fourth inequality follows by applying Lem. D.8 to η → s,
∇ 2 J ′ λ [η](w) • s w for s ∈ T w W arbitrary.
The following lemma provides explicit upper estimates of the regularity constants B 0 , B 1 , B 2 of ϕ appearing in Prop. F.2, in terms of the activation function φ and the data distribution ρ. Lemma F.3. Under Assumption 2, recall that ϕ :
W → H = L 2 ρ (R d+1
) is defined by ϕ(w)(x) = φ(⟨w, x⟩), and that φ : R → R is C 2 . There exists a universal constant c > 0 such that
sup w∈S d ∥ϕ(w)∥ H ≤ ∥φ∥ L 2 (ρ) , sup w∈S d ∥∇ϕ(w)∥ H ≤ ∥φ ′ ∥ L 4 (ρ) N 4 (ρ), sup w∈S d ∇ 2 ϕ(w) H ≤ ∥φ ′′ ∥ L 4 (ρ) + ∥φ ′ ∥ L 4 (ρ) N 4 (ρ)
where
N 4 (ρ) := sup ∥u∥ 2 ≤1 E x∼ρ ⟨u, x⟩ 4 1/4
and ∀f : R → R, ∥f ∥ L p (ρ) := sup
w∈S d (E x∼ρ |f (⟨w, x⟩)| p ) 1/p .
Note that if ρ is rotationally invariant, then E x∼ρ |f (⟨w, x⟩)| p is independent of w, and there exists a
universal constant c such that N 4 (ρ) ≤ cd -1/2 E x∼ρ ∥x∥ 4 1/4 .
Proof. For the first inequality, we have by definition
sup w ∥φ(w)∥ H = sup w E x∼ρ |φ(⟨w, x⟩)| 2 = ∥φ∥ L 2 (ρ) .
For the second inequality, define the orthogonal projector Π w = I d+1ww ⊤ : R d+1 → T w S d = {w} ⊥ for any w ∈ S d . Then [∇ϕ(w)] (x) = φ ′ (⟨w, x⟩)Π w x, so by Cauchy-Schwarz inequalities,
∥∇ϕ(w)∥ H = sup ∥f ∥ L 2 (ρ) ≤1 sup s∈TwS d ∥s∥ w =1 E x∼ρ [f (x) ⟨s, ∇ϕ(w)(x)⟩ w ] = sup s∈TwS d ∥s∥ w =1 E x∼ρ ⟨s, ∇ϕ(w)(x)⟩ 2 w 1/2 = sup s∈TwS d ∥s∥ w =1 E x∼ρ |φ ′ (⟨w, x⟩)| 2 ⟨Π w s, x⟩ 2 1/2 ≤ E x∼ρ |φ ′ (⟨w, x⟩)| 4 1/4
• sup
∥u∥ 2 =1 E x∼ρ ⟨u, x⟩ 4 1/4
since ∥s∥ w = ∥Π w s∥ 2 .
For the third inequality, the Riemannian Hessian of ϕ(w) = φ(⟨w, •⟩) : S d → R is given by
∇ 2 ϕ(w) (x) = ∇ 2 w φ(⟨w, x⟩) = ∇ ⊤ w [φ ′ (⟨w, x⟩)Π w x] = Π w φ ′′ (⟨w, x⟩)xx ⊤ -φ ′ (⟨w, x⟩) ⟨w, x⟩ Π w ,
so similarly by Cauchy-Schwarz inequalities,
∇ 2 ϕ(w) H ≤ sup s∈TwS d ∥s∥ w =1 E x∼ρ |φ ′′ (⟨w, x⟩)| 2 ⟨s, Π w x⟩ 2 1/2 + E x∼ρ |φ ′ (⟨w, x⟩)| 2 ⟨w, x⟩ 2 1/2 ≤ E x∼ρ |φ ′′ (⟨w, x⟩)| 4 1/4
• sup
s∈TwS d ∥s∥ w =1 E x∼ρ ⟨Π w s, x⟩ 4 1/4 + E x∼ρ |φ ′ (⟨w, x⟩)| 4 1/4 E x∼ρ ⟨w, x⟩4 1/4
.
Finally, suppose that ρ is rotationally invariant, and let us show that N 4 (ρ
) ≤ cd -1/2 E x∼ρ ∥x∥ 4 1/4
for some universal constant c. Indeed, for x ∼ ρ, we have that x and x = x/ ∥x∥ are independent and that x ∼ τ . Therefore,
N 4 4 (ρ) = sup ∥u∥ 2 ≤1 E x∼ρ ∥x∥ 4 ⟨u, x/ ∥x∥⟩ 4 = sup ∥u∥ 2 ≤1 E x∼ρ ∥x∥ 4 • E x∼τ ⟨u, x⟩ 4 ,
and
sup ∥u∥ 2 ≤1 E x∼τ ⟨u, x⟩ 4 ≤ c(
d+1) 2 for some universal constant c, which is a direct consequence of the fact that ⟨u, x⟩ is sub-Gaussian with sub-Gaussian norm c/ √ d + 1 for some universal constant c [Ver18, Theorem 3.4.6], along with the moment bound for sub-Gaussian random variables [Ver18, Proposition 2.5.2] Finally, we check rigorously in the following proposition that Assumption 2 with proper additional regularity assumptions on φ and ρ, is a special case of Assumption 1.
Proposition F.4. Consider W = S d and G : M(W) → R defined as in Assumption 2. Suppose furthermore that N 4 (ρ), ∥φ∥ L 2 (ρ) , ∥φ ′ ∥ L 4 (ρ) , ∥φ ′′ ∥ L 4 (ρ) < ∞, where N 4 (ρ) and ∥•∥ L p (ρ) are defined in Lem. F.3. Then, G and W satisfy Assumption 1.
Proof. The fact that S d satisfies α τ -LSI with α τ = d -1 is classical and can be found in [BGL14, Sec. 5.7].
By definition, G(ν) = 1 2 W ϕ(w)dν(w)y 2 H , so G is non-negative and admits second variations: for any ν ∈ M(W) and w, w
′ ∈ S d , G ′ [ν](w) = ϕ(w), W ϕ(w ′ )dν(w ′ ) -y H G ′′ [ν](w, w ′ ) = ⟨ϕ(w), ϕ(w ′ )⟩ H and ∇ w G ′′ [ν](w, w ′ ) = ⟨∇ϕ(w), ϕ(w ′ )⟩ H ∇ 2 w G ′′ [ν](w, w ′ ) = ∇ 2 ϕ(w), ϕ(w ′ ) H ∇ w ∇ w ′ G ′′ [ν](w, w ′ ) = ⟨∇ϕ(w), ∇ϕ(w ′ )⟩ H .
Consequently, denoting C i = sup w∈S d ∇ i ϕ H for i ∈ {0, 1, 2}, which are all finite by Lem. F.3,
|G ′′ [ν](w, w ′ )| ≤ C 2 0 =: L 0 ∥∇ w G ′′ [ν](w, w ′ )∥ w ≤ C 0 C 1 =: L 1 ∇ 2 w G ′′ [ν](w, w ′ ) w ≤ C 0 C 2 =: L 2 ∥∇ w ∇ w ′ G ′′ [ν](w, w ′ )∥ ≤ C 2 1 =: L 2 . Now for each i ∈ {0, 1, 2}, ∀(ν, w, w ′ ), ∇ i w G ′′ [ν](w, w ′ ) w ≤ L i =⇒ ∀(ν, ν ′ , w), ∇ i G ′ [ν] -∇ i G ′ [ν ′ ] w ≤ L i ∥ν -ν ′ ∥ T V .
Indeed, the right-hand side can be shown by applying the mean-value theorem to
g(θ) = s, ∇ i G ′ [ν + θ(ν ′ -ν)](w) w over θ ∈ [0, 1] for each s ∈ (T w W) ⊗i . Thus, to show the exis- tence of B i < ∞ such that ∀(ν, w, w ′ ), ∇ i G ′ [ν] w ≤ L i ∥ν∥ T V + B i ,
it suffices to check that there exists ν 0 such that ∥ν 0 ∥ T V and sup w ∇ i G ′ [ν 0 ] w < ∞. Note that for any ν and w,
∇ i G ′ [ν](w) = ∇ i ϕ(w), W ϕ(w ′ )dν(w ′ ) -y H , thus ∇ i G ′ [0](w) = -∇ i ϕ(w), y H and sup w ∇ i G ′ [0](w) w ≤ C i ∥y∥ H < ∞.
Hence the existence of the B i < ∞ is verified. This finishes the verification of Assumption 1.
F.2 Proof of Thm. 5.2
In the single-index setting of Assumption 3, it is intuitive that δ v is a minimizer of J λ , for any λ ≥ 0, and that η λ,β and δ v are close in certain regimes of β and λ. For this reason, we will first investigate the properties of
J ′ λ [δ v ] as a proxy of J ′ λ [η λ,β ],
to show that it is amenable to a refined analysis for proving LSI, in Sec. F.2.1. This step uses a Lyapunov approach inspired by [MS14; LE23]. We will then prove that these properties carry from
J ′ λ [δ v ] over to J ′ λ [η λ,β ],
in Sec. F.2.2, thanks to a quantitative bound on W 2 (η λ,β , δ v ) proved in Sec. F.2.3. Lemma F.5. Under Assumptions 2 and 3, we have
∀w ∈ S d , J ′ λ [δ v ](w) = - λ 2 λ + ∥ϕ(v)∥ 2 H -2 ⟨ϕ(v), ϕ(w)⟩ 2 H = - λ 2 λ + ∥φ∥ 2 L 2 (ρ) -2 |E x∼ρ φ(⟨x, v⟩)φ(⟨x, w⟩)| 2 = -λg(⟨v, w⟩)
for some g : [-1, +1] → R.
Proof. By Prop. F.1, since y = ϕ(v),
J ′ λ [δ v ] = - λ 2 ϕ(w), (K δv + λ id) -1 ϕ(v) 2 H . Since ϕ(v) is an eigenvector of K δv = W ϕ(w ′ )ϕ(w ′ ) * dδ v = ϕ(v)ϕ(v) * with eigenvalue ∥ϕ(v)∥ 2 H = E x∼ρ φ(⟨x, v⟩) 2 = ∥φ∥ 2 L 2 (ρ)
, it is also an eigenvector of (K δv + λ id) -1 with eigenvalue (∥φ(v)∥ 2 H + λ) -1 , whence the expression of J ′ λ [δ v ] follows. Moreover, by rotational invariance of ρ, E x∼ρ φ(⟨x, v⟩)φ(⟨x, w⟩) depends only on ⟨v, w⟩, for all w ∈ S d . In other words, there exists g such that J ′ λ [δ v ] = -λg(⟨v, •⟩). In summary, a Lyapunov condition of the form (F.4), along with a control on the eigenspectrum of ∇ 2 f (w), implies an LSI for e -βf τ /Z. We record this fact in the theorem below, working out the proper dependence on problem parameters for future use. Theorem F.6. Let v ∈ S d , 0 < λ ≤ 1 and f : S d → R of the form f (w) = -λg(⟨w, v⟩) for some increasing function g : [-1, 1] → R. Suppose there exist constants D 0 , D 1 , D 2 , D 3 , D 4 > 0, and r ∈ (0, π/2) such that if
β ≥ D 0 dλ -1 then ∀w ∈ S d , 1 2 ∆f - β 4 ∥∇f ∥ 2 ≤ D 1 λd (L S d ) ∀w ∈ S d \ U, 1 2 ∆f - β 4 ∥∇f ∥ 2 ≤ -D 2 βλ 2 (L U ) ∀w ∈ S d , λ min (∇ 2 f (w)) ≥ -D 3 λ (C S d ) ∀w ∈ U, λ min (∇ 2 f (w)) ≥ D 4 λ (C U )
where U = w ∈ S d ; dist W (w, v) ≤ r . Then (provided that β ≥ D 0 dλ -1 ) the probability measure ν = exp(-βf )τ /Z satisfies α-LSI for a constant α dependent only on the D i and on r.
Furthermore, if the condition on β is replaced by β ≥ D ′ 0 d 4 λ -4 and if (L S d ) is replaced by
∀w ∈ S d , 1 2 ∆f - β 4 ∥∇f ∥ 2 ≤ D ′ 1 λdβ 3/4 , (L ′ S d )
then (provided that β ≥ D ′ 0 d 4 λ -4 ) ν satisfies α ′ -LSI for a constant α ′ dependent only on D ′ 0 , D ′ 1 , D 2 , D 3 , D 4 and on r. where Ric g denotes the Ricci curvature of S d . As a result, ν satisfies Poincaré inequality with constant
κ ≥ D 2 β 2 λ 2 1 + D1λβd+D2β 2 λ 2 d-1+βλD4
≥ Cβλ, (F.5)
for some constant C depending only on the D i , where we used that β ≥ D 0 dλ -1 .
Moreover, by [LE23, Proposition 9.17], if ν ∈ P(S d ) satisfies the Poincaré inequality with constant κ, and β∇ 2 f + Ric g ≽ -βK for some K > 0 on S d , then for β ≥ 1, ν satisfies the LSI with constant α = κ 11βK . By the assumptions of the theorem, this indeed holds with K = D 3 λ. Consequently, ν satisfies LSI with constant α = C/(11D 3 ), which finishes the proof of the first part of the theorem.
The second part, with (L ′ S d ) instead of (L S d ), follows by a similar reasoning, except that "D 1 " should be replaced by "D ′ 1 β 3/4 " in the calculation of (F.5). This still leads to a bound of the form κ ≥ C ′ βλ provided that β ≥ D ′ 0 d 4 λ -4 , and the rest of the proof follows without change.
We now verify that J ′ λ [δ v ] satisfies the conditions of Thm. F.6. Proposition F.7. Under the assumptions of Thm. 5.2, 
f 0 := J ′ λ [δ v ]
∥∇f 0 (w)∥ 2 = λ 2 g ′ (⟨w, v⟩) 2 (1 -⟨w, v⟩ 2 )
and ∆f 0 (w) = Tr ∇ 2 f 0 (w) = -λ g ′′ (⟨w, v⟩)(1 -⟨w, v⟩ 2 )g ′ (⟨w, v⟩) ⟨w, v⟩ d .
Pose U = w ∈ S d : dist S d (w, v) ≤ r for some r > 0 to be chosen.
Let us verify (L S d ). We have for all w ∈ S d
1 2 ∆f 0 - β 4 ∥∇f 0 ∥ 2 = - λ 4 2g ′′ (⟨w, v⟩) + βλg ′ (⟨w, v⟩) 2 (1 -⟨w, v⟩ 2 ) + λ 2 g ′ (⟨w, v⟩)⟨w, v⟩d.
(F.6) The second term is bounded by λ 2 C 1 d. We can ensure that the first term is non-positive by appropriately restricting β as follows:
inf [-1,1] 2g ′′ + βλ(g ′ ) 2 ≥ 0 ⇐= 2(inf g ′′ ) + βλ(inf g ′ ) 2 ≥ 0 ⇐= -2C 2 + βλc 2 1 ≥ 0 ⇐⇒ β ≥ 2C 2 c 2 1 λ -1 .
Let us verify (L U ). We can upper-bound the first term in (F.6) by a negative quantity by restricting β further: by a similar calculation as just above,
β ≥ 4C 2 c 2 1 λ =⇒ inf [-1,1] 2g ′′ + β 2 λ(g ′ ) 2 ≥ 0 =⇒ 2g ′′ + βλ(g ′ ) 2 ≥ β 2 λ(g ′ ) 2 over [-1, 1].
Then for all w ∈ S d \ U , we have r ≤ dist W (w, v) = arccos(⟨w, v⟩) ≤ π 2 1 -⟨w, v⟩ 2 , and so
1 2 ∆f 0 - β 4 ∥∇f 0 ∥ 2 ≤ - λ 4 1 2 βλg ′ (⟨w, v⟩) 2 (1 -⟨w, v⟩ 2 ) + λ 2 g ′ (⟨w, v⟩)⟨w, v⟩d = λ 4 g ′ (⟨w, v⟩) - βλ 2 g ′ (⟨w, v⟩)(1 -⟨w, v⟩ 2 ) + 2⟨w, v⟩d ≤ λ 4 g ′ (⟨w, v⟩) - 2βλc 1 r 2 π 2 + 2⟨w, v⟩d ≤ - λ 4 g ′ (⟨w, v⟩) • βλc 1 r 2 π 2 ≤ - c 2 1 4π 2 βλ 2 r 2 provided that β ≥ 2π 2 d λc1r 2 .
To verify (C S d ), simply note that, since
Π w vv ⊤ Π w op = ∥Π w v∥ 2 = 1 -⟨w, v⟩ 2 , ∀w, ∇ 2 f 0 (w) op ≤ λg ′′ (⟨w, v⟩)(1 -⟨w, v⟩ 2 ) + λC 1 ≤ λ sup s∈[-1,1] g ′′ (s)(1 -s 2 ) + λC 1 ≤ (C 3 + C 1 )λ,
and therefore, inf where the bound of the second term follows from ⟨w, v⟩ ≥ 0, which can be ensured by taking r ≤ π 2 . Since w ∈ U ⇐⇒ ⟨w, v⟩ ≥ cos(r) ≥ 1r 2 , it follows that
w∈S d λ min (∇ 2 f 0 (w)) ≥ -sup w ∇ 2 f 0 (w) op ≥ -(C 3 + C 1 )λ.
λ min (∇ 2 f 0 (w)) ≥ -λ sup cos r≤s≤1 |g ′′ (s)| (1 -s 2 ) + λc 1 cos r ≥ -λC 3 sup cos r≤s≤1 1 -s 2 + λc 1 cos r = λ (-C 3 sin r + c 1 cos r) ≥ λ c 1 2
for a certain choice of r small enough, dependent only on c 1 and C 3 .

Section: F.2.2 Lyapunov function analysis for bounding the LSI constant of η λ,β
To prove Thm. 5.2, it only remains to show that the conditions of Thm. F.6 are satisfied for
J ′ λ [η λ,β ] instead of J ′ λ [δ v ].
Lemma F.8. Under the setting of Assumptions 2 and 3, η λ,β is rotationally invariant except for the direction v, or formally Rv = v =⇒ R ♯ η λ,β = η λ,β for orthonormal matrices R, where R ♯ η denotes the pushforward measure. Moreover, there exists g η : [-1, 1] → R such that for all w ∈ S d , J ′ λ [η λ,β ](w) = -λg η (⟨w, v⟩).
Proof. The lemma follows directly from the fact that ρ is rotationally invariant and that y = ϕ(v).
Lemma F.9. Under Assumption 2, suppose furthermore that sup w ∇ i ϕ(w) H ≤ B i < ∞ for i ∈ {0, 1, 2}. Then we have, for any η, η ′ ∈ P(W),
∀w ∈ S d , 1 2 ∆J ′ λ [η] - β 4 ∥∇J ′ λ [η]∥ 2 - 1 2 ∆J ′ λ [η ′ ] + β 4 ∥∇J ′ λ [η ′ ]∥ 2 ≤ d 2B 0 B 1 (B 0 B 2 + B 2 1 ) λ 2 ∥y∥ 2 H + β 2B 3 0 B 3 1 λ 3 ∥y∥ 4 H W 1 (η, η ′ ) and λ min (∇ 2 J ′ λ [η]) -λ min (∇ 2 J ′ λ [η ′ ]) ≤ 4B 0 B 1 (B 0 B 2 + B 2 1 ) λ 2 ∥y∥ 2 H W 1 (η, η ′ ).
Proof. By Prop. F.2,
∇ 2 J ′ λ [η](w) -∇ 2 J ′ λ [η ′ ](w) op ≤ 4B 0 B 1 (B 0 B 2 + B 2 1 ) λ 2 ∥y∥ 2 H W 1 (η, η ′ ) and λ min (∇ 2 J ′ λ [η](w)) -λ min (∇ 2 J ′ λ [η ′ ](w)) ≤ ∇ 2 J ′ λ [η](w) -∇ 2 J ′ λ [η ′ ](w)
op by Weyl's inequality. This shows the second inequality of the lemma.
For the first inequality, we have
∆J ′ λ [η](w) = Tr ∇ 2 J ′ λ [η](W ) and so 1 2 ∆J ′ λ [η] - 1 2 ∆J ′ λ [η ′ ] ≤ d 2 ∇ 2 J ′ λ [η](w) -∇ 2 J ′ λ [η ′ ](w) op ≤ d 2 4B 0 B 1 (B 0 B 2 + B 2 1 ) λ 2 ∥y∥ 2 H W 1 (η, η ′ ).
Moreover, we showed in (F.2) resp. in Prop. F.2 that
∥∇J ′ λ [η]∥ ≤ B 0 B 1 λ ∥y∥ 2 H and ∥∇J ′ λ [η] -∇J ′ λ [η ′ ]∥ ≤ 4B 2 0 B 2 1 λ 2 ∥y∥ 2 H W 1 (η, η ′ ), so β 4 ∥∇J ′ λ η]∥ 2 - β 4 ∥∇J ′ λ [η ′ ]∥ 2 ≤ β 4 • 2 B 0 B 1 ∥y∥ 2 H λ • 4B 2 0 B 2 1 ∥y∥ 2 H λ 2 W 1 (η, η ′ ) = β 2B 3 0 B 3 1 ∥y∥ 4 H λ 3 W 1 (η, η ′ ),
which implies the first inequality of the lemma by triangle inequality.
We can now proceed to the proof of Thm. 5.2, thanks to a bound on W 2 (η λ,β , δ v ) under Assumption 3 proved in the next section.
Proof of Thm. 5.2. For concision, in this proof, we will use the notations O(•), Ω(•), Θ(•), ≲ to hide constants dependent only on ∥φ∥
L 2 (ρ) , ∥φ ′ ∥ L 4 (ρ) , ∥φ ′′ ∥ L 4 (ρ) , E x∼ρ ∥x∥ 4 /d 2 , c 1 , C 1 , C 2 , C 3 and C 4 .
We established in Prop. F.7 that f 0 := J ′ λ [δ v ] satisfies the conditions (L S d ) (L U ) (C S d ) (C U ) of Thm. F.6 with some constants D i , r = O(1) (in fact only dependent on c 1 , C 1 , C 2 , C 3 ) provided that β ≥ D 0 dλ -1 . Thus, the first part of the theorem concerning the LSI of δ v ∝ e -βJ ′ λ [δv] τ , follows from Thm. F.6. To prove the second part of the theorem, it suffices to show that f
* := J ′ λ [η λ,β ] satisfies the conditions (L ′ S d ) (L U ) (C S d ) (C U ) of Thm. F.6 with some constants D ′ 0 , D ′ 1 , D 2 , D 3 , D 4 , r = Θ(1). By Lem. F.3, there exist constants B i = O(1) such that sup w ∇ i ϕ(w) H ≤ B i , for i ∈ {0, 1, 2}.
Moreover, by Lem. F.12 below, provided that β ≥ Ω(dλ), one has
W 2 (η λ,β,δv ) ≲ β -1 dλ -1 • log(βd -1 λ -1 ) =: W .

Section: Now by the conditions (L
S d ) (L U ) (C S d ) (C U ) for f = f 0 and D i = Θ(
1) (by Prop. F.7), from Lem. F.9 along with the triangle inequality we have
∀w ∈ S d , 1 2 ∆f * - β 4 ∥∇f * ∥ 2 ≲ λd + (dλ -2 + βλ -3 )W ∀w ∈ S d \ U, 1 2 ∆f * - β 4 ∥∇f * ∥ 2 ≤ -D 2 βλ 2 + E 2 • (dλ -2 + βλ -3 )W ∀w ∈ S d , λ min (∇ 2 f * (w)) ≳ -λ -λ -2 W ∀w ∈ U, λ min (∇ 2 f * (w)) ≥ D 4 λ -E 4 • λ -2 W for some constants E 2 , E 4 = O(1). So, • (L ′ S d ) for f * can be ensured with D ′ 1 = O(1) provided that (dλ -2 + βλ -3 )W = (β -1 dλ + 1)βλ -3 W = O(λdβ 3/4
). Since we already assume that β ≥ Ω(dλ), this is equivalent to βλ -3 W = O(λdβ 3/4 ), i.e., β 1/4 λ -4 d -1 W = O(1).

Section: • (L
U ) can be ensured with D 2 = D2 2 if β is such that E 2 (dλ -2 + βλ -3 )W ≤ D2 2 βλ 2 , i.e., (β -1 dλ + 1)λ -5 W ≤ D2
2E2 . Since we already assume that β ≥ Ω(dλ), this is equivalent to λ -5 W ≤ F 2 for a certain F 2 = Θ(1).
• (C S d ) can be ensured with D 3 = O(1) provided that λ -2 W = O(λ), i.e., λ -3 W = O(1).
• (C U ) can be ensured with
D 4 = D4 4 if E 4 λ -2 W ≤ D4 2 λ, i.e., λ -3 W ≤ D4 2E4 =: F 4 = Θ(1).
In summary, since we assume λ ≤ 1, we have λ -3 ≤ λ -5 and λ -4 d -1 ≤ λ -5 . Hence we will choose β such that β 1/4 d -1 λ -4 W = O(1) and λ -5 W ≤ F 2 for a certain F 2 = Θ(1), and this will ensure all four conditions with constants D ′ 1 , D 2 , D 3 , D 4 = Θ(1). For choices of β such that β ≥ d 4 λ -4 , it suffices to have β 1/4 d -1 λ -4 W ≤ F 2 . Now substituting the definition of W , this sufficient condition rewrites
β 1/4 d -1 λ -4 W ≤ F 2 ⇐⇒ β 1/2 d -2 λ -8 • β -1 dλ -1 log β dλ = β -1/2 λ -9 d -1 log β dλ ≤ F 2 2 .
Since ∀ε, x > 0, ε log x = log x ε ≤ x ε , then for any ε > 0 it suffices to choose β such that
β -1/2 λ -9 d -1 β dλ ε ≤ εF 2 2 ⇐⇒ β 1/2-ε ≥ ε -1 F -2 2 λ -9-ε d -1-ε .
Choosing e.g. ε = 1 4 , we get that a sufficient condition is β ≥ Ω(poly(λ -1 , d)). Hence we may apply the second part of Thm. F.6 to f
* = J ′ λ [η λ,β ] with constants D ′ 1 , D 2 , D 3 , D 4 = O(1), provided that β ≥ Ω(poly(λ -1 , d))
. This concludes the proof of the second part of the theorem.

Section: F.2.3 Bound on
W 1 (η λ,β , δ v )
The following lemma shows a form of weak coercivity of J λ . Lemma F.10. Under Assumptions 2 and 3, if furthermore there exist c
1 , C 1 , C 3 , C 4 > 0 such that ∀r ∈ [-1, +1], c 1 ≤ g ′ (r) ≤ C 1 , g ′′ (r)(1 -r 2 ) 1/2 ≤ C 3 , g ′′′ (r)(1 -r 2 ) 3/2 ≤ C 4 ,
then there exists a constant α g dependent only on c
1 , C 1 , C 3 , C 4 such that ∀η, J λ (η) -J λ (δ v ) ≥ λα g W 2 2 (η, δ v ). Proof. Since J λ is convex, J λ (η) -J λ (δ v ) ≥ S d J ′ λ [δ v ]d(η -δ v ) = -λ S d g(⟨v, w⟩)d(η -δ v )(w) = λ S d [g(1) -g(⟨v, w⟩)] dη(w). Now let U r = w ∈ S d ; dist S d (w, v) ≤ r
for some r > 0 to be chosen. We will compute the integral separately on U r and on S d \ U r .
For the part Ur , we proceed by a second-order Taylor expansion. Namely, for any w ∈ U r \ {v}, let e ⊥ v such that w = cos(θ)v + sin(θ)e for some 0 < θ ≤ r, since dist S d (w, v) = arccos(⟨w, v⟩) = θ. Then g(⟨v, w⟩) = g(cos θ), and d dθ g(cos θ) =sin(θ)g ′ (cos θ) d 2 dθ 2 g(cos θ) = sin(θ) 2 g ′′ (cos θ)cos(θ)g ′ (cos θ) d 3 dθ 3 g(cos θ) =sin(θ) 3 g ′′′ (cos θ) + 3 sin(θ) cos(θ)g ′′ (cos θ) + sin(θ)g ′ (cos θ).
Notice that by our assumptions on g, it is smooth enough at 1 so that sin(θ)g ′ (cos θ) → 0 and sin(θ) 2 g ′′ (cos θ) → 0 as θ → 0. Further,
sup θ d 3 dθ 3 g(cos θ) ≤ C 4 + 3C 3 + C 1 =: 6M 3,g .
Consequently, by a univariate Taylor expansion with remainder in Langrange form around θ = 0, for all 0 < θ ≤ r, provided that we choose r ≤ g ′ (1) 2M3,g , we have
g(cos θ) = g(1) + 0 + 1 2 (0 -g ′ (1))θ 2 + 1 6 (g • cos) (3) (u)θ 3 for some u ∈ [0, r] ≤ g(1) - 1 2 g ′ (1)θ 2 + 1 6 sup [0,r] (g • cos) (3) θ 3 ≤ g(1) - 1 2 g ′ (1)θ 2 + M 3,g θ 3 = g(1) - 1 2 g ′ (1) -M 3,g θ θ 2 ≤ g(1) - 1 4 g ′ (1)θ 2 . (F.7)
In other words,
∀w ∈ U r , g(1) -g(⟨v, w⟩) ≥ 1 4 g ′ (1) dist S d (w, v) 2 ,
and so,
Ur [g(1) -g(⟨v, w⟩)] dη(w) ≥ 1 4 g ′ (1) Ur dist S d (w, v) 2 dη(w).
For the part S d \Ur , since g is increasing on [-1, 1] since g ′ ≥ c 1 > 0, we have
S d \Ur [g(1) -g(⟨v, w⟩)] dη(w) ≥ [g(1) -g(cos(r))] [1 -η(U r )] ≥ 1 4 g ′ (1)r 2 [1 -η(U r )]
where the second inequality follows from the Taylor expansion (F.7) above applied to θ = r.
Thus we showed
J λ (η) -J λ (δ v ) ≥ λ 1 4 g ′ (1)r 2 [1 -η(U r )] + g ′ (1) 4 Ur dist S d (w, v) 2 dη(w) = λg ′ (1) 4 r 2 [1 -η(U r )] + Ur dist S d (w, v) 2 dη(w) .
On the other hand, since dist S d (v, w) = arccos(⟨v, w⟩),
W 2 2 (η, δ v ) = S d \Ur dist S d (v, w) 2 dη(w) + Ur dist S d (v, w) 2 dη(w) ≤ π 2 [1 -η(U r )] + Ur dist S d (v, w) 2 dη(w). Hence J λ (η) -J λ (δ v ) ≥ λg ′ (1) 4 • sup 0≤r≤ g ′ (1) 2M 3,g min r 2 π 2 , 1 W 2 2 (η, δ v ) = λ • g ′ (1) 4 min g ′ (1) 2M 3,g 2 /π 2 , 1 • W 2 2 (η, δ v ) ≥ λ • c 1 4 min c 1 2M 3,g 2 /π 2 , 1 • W 2 2 (η, δ v ) =: λα g W 2 2 (η, δ v ).
Notice that α g only depends on c 1 , C 1 , C 3 , C 4 .
We will use the following fact about the surface area of a small hyperspherical cap around a pole for bounding W 1 (η λ,β , δ v ). It essentially shows that, for W = S d , the constant called C in the statement of Lem. E.1 scales with dimension as 2
-d ≲ C -1 ≲ 1/ √ d.
Lemma F.11. Fix d ≥ 2 and v ∈ S d and denote by τ the uniform measure on S d . For any ϵ > 0, let
S ϵ = w ∈ S d : dist S d (w, v) ≤ ϵ . There exist universal constants C -, C + > 0 such that ∀0 < ϵ ≤ π 4 , C -1 -(ϵ/2) d ≤ τ (S ϵ ) ≤ C + ϵ d / √ d.
Proof. For w ∼ τ , the distribution of ⟨w, v⟩ admits a probability density function h(z) = (1z 2 ) d/2-1 /Z, where
Z = 1 -1 (1 -z 2 ) d/2-1 dz = B d 2 , 1 2 = Γ d 2 √ π Γ d+1 2 .
Note that by Gautschi's inequality ∀s ∈ (0, 1), ∀x > 0,
x 1-s < Γ(x+1) Γ(x+s) < (x + 1) 1-s applied to s = 1 2 and x = d-1 2 , we have d-1 2 < Γ( d+1 2 ) Γ( d 2 ) < d+1 2 , so 2π d + 1 ≤ Z ≤ 2π d -1 . By definition, since dist S d (w, v) = arccos(⟨w, v⟩), τ (S ϵ ) = 1 cos(ϵ) h(z)dz. One can verify ∀ 0 < ϵ ≤ π 4 , 1 -ϵ 2 ≤ cos(ϵ) ≤ 1 - ϵ 2 4 . So for all 0 < ϵ ≤ π 4 , τ (S ϵ ) = 1 cos(ϵ) h(z)dz ≤ 1 √ 1-ϵ 2 h(z)dz = Z -1 1 √ 1-ϵ 2 (1 -z 2 ) d/2-1 dz = Z -1 1 1-ϵ 2 (1 -t) d/2-1 dt 2 √ t ≤ Z -1 1 2 √ 1 -ϵ 2 1 1-ϵ 2 (1 -t) d/2-1 dt = Z -1 1 2 √ 1 -ϵ 2 ϵ 2 0 t d/2-1 dt = Z -1 1 2 √ 1 -ϵ 2 • 2 d [ϵ 2 ] d/2 ≤ Z -1 1 d 1 -(π/4) 2 ϵ d ≤ C + ϵ d / √ d
for some universal constant C + . In the other direction,
τ (S ϵ ) ≥ 1 √ 1-ϵ 2 /4 h(z)dz = Z -1 1 √ 1-ϵ 2 /4 (1 -z 2 ) d/2-1 dz = Z -1 1 1-ϵ 2 /4 (1 -t) d/2-1 dt 2 √ t ≥ Z -1 1 2 1 1-ϵ 2 /4 (1 -t) d/2-1 dt = Z -1 1 2 ϵ 2 /4 0 t d/2-1 dt = Z -1 1 2 2 d [ϵ 2 /4] d/2 = Z -1 1 d (ϵ/2) d ≥ c(ϵ/2) d / √ d.
for some universal constants c. By repeating the same argument with 1 -ϵ 2 4 replaced by 1 -ϵ 2 3.9 , we get τ (S ϵ ) ≥ c ′ (ϵ/1.99
) d / √ d ≥ C -1 -(ϵ/2) d for some universal constants c ′ , C -.
The following lemma combines the weak coercivity and weak Lipschitz-continuity of J λ by a Γ-convergence type argument, to show an explicit bound on W 1 (η λ,β , δ v ). It quantifies the intuitive fact that η λ,β converges weakly to δ v when β -1 → 0 or λ → +∞.
Lemma F.12. Under Assumptions 2 and 3, if
sup w ∇ i ϕ(w) H ≤ B i < ∞ for i ∈ {0, 1}, and if β ≥ 4dλ π B 0 B 1 ∥y∥ 2 H -1 , then W 2 (η λ,β , δ v ) ≤ 1 α g β -1 d λ C + log B 0 B 1 ∥y∥ 2 H -log (β -1 dλ)
where C is a universal constant and α g is the constant from Lem. F.10.
Proof. Since η λ,β = arg min J λ,β and J λ,β = J + β -1 H (•|τ ), then for any η σ ∈ P(W),
J λ (η λ,β ) ≤ J λ (η λ,β ) + β -1 H (η λ,β |τ ) = J λ,β (η λ,β ) ≤ J λ,β (η σ ) = J λ (η σ ) + β -1 H (η σ |τ ) .
Further, we showed in Lem. F.10 that ∀η,
J λ (η) -J λ (δ v ) ≥ λα g • W 2 2 (η, δ v ), so λα g • W 2 2 (η λ,β , δ v ) ≤ J λ (η λ,β ) -J λ (δ v ) ≤ J λ (η σ ) -J λ (δ v ) + β -1 H (η σ |τ ) .
It remains to upper-bound the right-hand side, which we do by choosing as η σ a box-kernel smoothed version of δ v (this part the proof is essentially an instantantiation of Lem. E.1). Specifically, let η σ be the uniform measure over the spherical cap S σ = w ∈ S d ; dist S d (w, v) ≤ σ for σ to be chosen. We showed in Prop. F.2 that
J λ (η σ ) -J λ (δ v ) ≤ B 0 B 1 ∥y∥ 2 H λ • W 1 (η σ , δ v )
where sup w ∇ i ϕ(w) H ≤ B i , and by definition
W 1 (η σ , δ v ) = dist S d (w, v) dη σ (w) = 1 vol(S σ ) Sσ dist S d (w, v) d vol(w) ≤ σ.
Moreover by Lem. F.11, provided that 0 < σ ≤ π 4 ,
H (η σ |τ ) = dη σ log dη σ dτ = log vol(S d ) vol(S σ ) = -log τ (S σ ) ≤ log C -d log σ 2
for some universal constant C, and let us assume w.l.o.g. that C > 1, so that log C ≤ d log C. Thus
J λ (η σ ) -J λ (δ v ) + β -1 H (η σ |τ ) ≤ B 0 B 1 ∥y∥ 2 H λ σ -β -1 d log σ + β -1 d log 2C.
Therefore, taking the infimum over 0 < σ ≤ π 4 ,
λα g • W 2 2 (η λ,β , δ v ) ≤ inf 0<σ≤ π 4 B 0 B 1 ∥y∥ 2 H λ σ -β -1 d log σ + β -1 d log 2C = β -1 d -β -1 d log β -1 dλ B 0 B 1 ∥y∥ 2 H + β -1 d log 2C = β -1 d 1 + log(2C) -log(β -1 dλ) + log B 0 B 1 ∥y∥ 2 H
, where on the second line we used that the unconstrained infimum of the right-hand side over σ > 0 is attained at σ = β -1 dλ B0B1∥y∥ 2 H , which is indeed less than π 4 by assumption. This shows the bound
W 2 (η λ,β , δ v ) ≤ 1 λα g β -1 d 1 + log(2C) -log(β -1 dλ) + log B 0 B 1 ∥y∥ 2 H
, and the bound announced in the proposition follows by gathering some universal constants into C. 
P k,d (t) = (-1) k Γ(d/2) 2 k Γ(k + d/2) (1 -t 2 ) (2-d)/2 d dt k (1 -t 2 ) k+(d-2)/2 .
We now go over some useful properties of spherical harmonics and Legendre polynomials.
• (Addition Formula) We have the following formula which relates Legendre polynomials to spherical harmonics [AH12, Theorem 2.9],
N (d,k) j=1 Y kj (w)Y kj (v) = N (d, k)P k,d (⟨w, v⟩), ∀w, v ∈ S d . • (Hecke-Funk Formula) Suppose ϕ ∈ L 2 (τ ) is given by ϕ(•) = φ(⟨w, •⟩) for some w ∈ S d . Then [AH12, Theorem 2.22], ⟨ϕ, Y kj ⟩ L 2 (τ ) = Γ((d + 1)/2) Γ(d/2) √ π Y kj (w) 1 -1 φ(t)P k (t)(1 -t 2 ) (d-2)/2 dt.
• (Orthogonality of Legendre Polynomials) Using the addition formula and orthonormality of spherical harmonics, for every k, k ′ ≥ 0 we have,
⟨P k,d (⟨w, •⟩), P k ′ ,d (⟨v, •)⟩ L 2 (τ ) = δ kk ′ P k,d (⟨w, v⟩) N (d, k) .
• (Derivative of Legendre Polynomials) For every k ≥ j, we have the following identity for derivatives of Legendre polynomials [AH12, Equation (2.89)], P
k,d (t) = c j,k,d P k-j,d+2j (t),
where P We use the tools introduced above to prove the following lemma. Lemma F.13. Suppose ρ is a spherically symmetric probability measure on R d+1 . Define q : [-1, 1] → R via q(⟨w, v⟩) = φ(⟨w, x⟩)φ(⟨v, x⟩)dρ(x) for w, v ∈ S d . Then, for every j ≥ 1, q (j) (⟨w, v⟩) = 1 (d + 1)(d + 3) . . . (d + 2j -1) ∥x∥ 2j φ (j) (⟨w, x⟩)φ (j) (⟨v, x⟩)dρ(x),
where φ (j) denotes the jth derivative of φ.
Proof. We being by introducing the notation φ r (⟨w, x⟩) = φ(r⟨w, x⟩). Doing so allows us to only consider functions on S d by conditioning on the norm of input ∥x∥. Notice that q(⟨w, v⟩) = E E φ ∥x∥ w, x ∥x∥ ⟩ φ ∥x∥ v, x ∥x∥ ⟩ | ∥x∥ = E ∥x∥ q ∥x∥ (⟨w, v⟩) , (F.9) where q r (⟨w, v⟩) := φ(r⟨w, x⟩)φ(r⟨v, x⟩)dτ (x) = ⟨φ r (⟨w, •⟩), φ r (⟨v, •⟩)⟩ L 2 (τ ) .
By the Hecke-Funk formula, where the last identity follows from (F.11). We can now use the fact that φ (j) r = rφ (j) , and plug the above back into (F.9) to obtain q (j) (⟨w, v⟩) = E ∥x∥ q , we have q(z) ≥ 1 2 (inf |z|≤m φ(z)). Furthermore, by the Cauchy-Schwartz inequality, q(⟨w, v⟩) ≤ E[φ(⟨w, x⟩) 2 ] = ∥φ∥ 2 L 2 (ρ) . Next, we move on to bounding q ′ . Let x ∼ τ be a uniform random vector on S d . Then, for any r > 0, by Lem. F.13, , we obtain q ′ ≥ b1 2 (inf |z|≤m ϕ ′ (z)) 2 . Moreover, by the Cauchy-Schwartz inequality, q ′ ≤ b 2 ∥φ ′ ∥ 2 L 4 (ρ) . As a result, b 1 (inf |z|≤m φ(z)) 2 (inf |z|≤m φ ′ (z)) 2 , which completes the proof.
q ′ (⟨w, v⟩) = 1 d + 1 E ∥x∥ 2 φ ′ (⟨w, x⟩)φ ′ (⟨v, x⟩) = 1 d + 1 E ∥x∥ 2 E [φ ′ (
(λ + ∥φ∥ 2 L 2 (ρ) ) 2 ≤ g ′ ≤ b 2 ∥φ∥ 2 L 2 (ρ) ∥φ ′ ∥ 2 L 4 (ρ) (λ + ∥φ∥ 2 L 2 (ρ)
• The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. • The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) • The assumptions made should be given (e.g., Normally distributed errors). • It should be clear whether the error bar is the standard deviation or the standard error of the mean. • The name of the license (e.g., CC-BY 4.0) should be included for each asset.
• For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. • If this information is not available online, the authors are encouraged to reach out to the asset's creators.

Section: New Assets
Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
Answer: [NA]
Justification: The paper does not release new assets.
Guidelines:
• The answer NA means that the paper does not release new assets.
• Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. • The paper should discuss whether and how consent was obtained from people whose asset is used. • At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

Section: Crowdsourcing and Research with Human Subjects
Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?
Answer: [NA]
Justification: The paper does not involve crowdsourcing nor research with human subjects.
Guidelines:
• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. • Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. • According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

Section: Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects
Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?
Answer: [NA]
Justification: The paper does not involve crowdsourcing nor research with human subjects.
Guidelines:
• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
• Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution. • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.

Section: 
F.4 Implementation details for Fig. 1 We consider the problem (1.1) where W = S d and G is defined as in Assumption 2, where d = 10, λ = 10 -3 and • y : R d+1 → R is given by a teacher 2NN with 5 neurons defined as follows. The first-layer weights are orthonormal, drawn from the Haar measure, and the second layer weights are drawn i.i.d. from N (0, 1.8I d ). Its activation is φ teacher (z) = z 4 -6z 2 +3
√ 24
, which is the normalized 4th degree Hermite polynomial.
• ρ is the empirical distribution of a (covariate) dataset (x i ) i≤n of n = 100 training samples, sampled i.i.d. from N 0 d 1 , I d 0 0 0 , with the last coordinate representing bias.
• The activation function φ of the student 2NN ŷν is the ReLU, φ(z) = max(0, z).
We performed 5 different runs, each corresponding to a different teacher network (y) and training dataset (ρ), and tested all the algorithms considered at each run. So the objective functional G λ is different for each run, which is why the values shown on the y-axis are offset by G * λ , the best value achieved by any of the algorithms considered for each run.
For the algorithms using the bilevel formulation, we computed the values and the Wasserstein gradients of J λ explicitly by the formulas from Prop. F.1 and (F.2) (the matrix
For the algorithms using MFLD, we used β -1 = 10 -3 . We ran the Euler-Maruyama discretization of the noisy particle gradient flow SDE described in Sec. 2 (with an inexact simulation of the Brownian increments described below), using N = 1000 particles -corresponding to the width of the student 2NN -, and a step size of 10 -2 for (1a) and 10 -3 for (1b). For Wasserstein GF without noise, we used the same discretization but with β -1 = 0.
Concerning the initialization of the particles (r i , w i ) i≤N -corresponding to the second resp. firstlayer weights of the student network -, the w i 0 are drawn i.i.d. uniformly on S d , and for the algorithms using the lifting formulation, the r i 0 are drawn i.i.d. from N (0, 1). Note that our simulations of Brownian motion are not exact. To implement MFLD on S d , we simply took gradient steps in R d+1 with added Gaussian noise, and projected the weights back to the sphere.

Section: NeurIPS Paper Checklist


Section: Claims
Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope? Answer: [Yes] Justification: The claims made in the abstract and introduction match the paper's contributions. In particular the three bullet points concluding the introduction summarize the paper's contributions section by section. Guidelines:
• The answer NA means that the abstract and introduction do not include the claims made in the paper. • The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

Section: Limitations
Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The scope and limitations of each optimization dynamics considered is clearly discussed within each section. Guidelines:
• The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. • The authors are encouraged to create a separate "Limitations" section in their paper.
• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. • The authors should reflect on the factors that influence the performance of the approach.
For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren't acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

Section: Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
Answer: [Yes] Justification: For each theorem or proposition or corollary or lemma, be it in the main text or in the appendix, the assumptions are clearly stated, and all proofs are provided.
Guidelines:
• The answer NA means that the paper does not include theoretical results.
• All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. • All assumptions should be clearly stated or referenced in the statement of any theorems.
• The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. • Theorems and Lemmas that the proof relies upon should be properly referenced.

Section: Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
Answer: [Yes] Justification: The contributions of this work are theoretical. A numerical illustration is given in Fig. 1, for which the implementation details allowing to reproduce the experiment are provided in Sec. F.4.

Section: Guidelines:
• The answer NA means that the paper does not include experiments.
• If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. • Depending on the contribution, reproducibility can be accomplished in various ways.
For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. , with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility.
In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

Section: Open access to data and code
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
Answer: [Yes]
Justification: We provide in Sec. F.4 full details for the small numerical experiment of Fig. 1, which are sufficient to reproduce the experiment. The code we used will also be made public at a later date.
Guidelines:
• The answer NA means that paper does not include experiments requiring code. • Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

Section: Experimental Setting/Details
Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?
Answer: [Yes] Justification: The setup of the numerical experiment of Fig. 1 is very simple. Moreover full details are provided in Sec. F.4.
Guidelines:
• The answer NA means that the paper does not include experiments.
• The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. • The full details can be provided either with the code, in appendix, or as supplemental material.

Section: Experiment Statistical Significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
Answer: [Yes] Justification: The purpose of the small experiment from Fig. 1 is to compare the qualitative behavior of several algorithms: advantage of MFLD over Wasserstein GF in Fig. 1a, and advantage of MFLD-Bilevel over MFLD-Lifting in Fig. 1b. This qualitative behavior is clear-cut across the 5 runs, all of which are shown.
Guidelines:
• The answer NA means that the paper does not include experiments.
Justification: The contributions of this work are theoretical.
Guidelines:
• The answer NA means that there is no societal impact of the work performed.
• If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. • The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

Section: Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
Answer: [NA] Justification: The contributions of this work are theoretical.
Guidelines:
• The answer NA means that the paper poses no such risks.
• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
Answer: [NA]
Justification: The paper does not use existing assets.
Guidelines:
• The answer NA means that the paper does not use existing assets.
• The authors should cite the original paper that produced the code package or dataset.
• The authors should state which version of the asset is used and, if possible, include a URL.


References:
[b0] Kendall Atkinson; Weimin Han (2012). Spherical harmonics and approximations on the unit sphere: an introduction. Springer Science & Business Media
[b1] Francis Bach (2017). Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research
[b2] Francis Bach (2019). The "η-trick" reloaded: multiple kernel learning. 
[b3] Francis Bach (2021). The quest for adaptivity. 
[b4] Dominique Bakry; Ivan Gentil; Michel Ledoux (2014). Analysis and geometry of Markov diffusion operators. Springer
[b5] Yoshua Bengio; Nicolas Roux; Pascal Vincent; Olivier Delalleau; Patrice Marcotte (2005). Convex neural networks. Advances in neural information processing systems
[b6] Raphaël Berthier; Andrea Montanari; Kangjie Zhou (2023). Learning time-scales in two-layers neural networks. 
[b7] Alberto Bietti; Joan Bruna; Loucas Pillaud-Vivien (2023). On learning gaussian multiindex models with gradient flow. 
[b8] Nicolas Boumal (2023). An introduction to optimization on smooth manifolds. Cambridge University Press
[b9] Dmitri Burago; Yuri Burago; Sergei Ivanov (2001). A course in metric geometry. American Mathematical Society Providence
[b10] René Carmona; François Delarue (2013). Probabilistic analysis of mean-field games. SIAM Journal on Control and Optimization
[b11] Olivier Chapelle; Vladimir Vapnik; Olivier Bousquet; Sayan Mukherjee (2002). Choosing multiple parameters for support vector machines. Machine learning
[b12] Fan Chen; Yiqing Lin; Zhenjie Ren; Songbo Wang (2024). Uniform-in-time propagation of chaos for kinetic mean field Langevin dynamics. Electronic Journal of Probability
[b13] Fan Chen; Zhenjie Ren; Songbo Wang (2022). Uniform-in-time propagation of chaos for mean field langevin dynamics. 
[b14] Lénaïc Chizat (2017). Unbalanced optimal transport: Models, numerical methods, applications. 
[b15] Lénaïc Chizat (2022). Convergence rates of gradient methods for convex optimization in the space of measures. Open Journal of Mathematical Optimization
[b16] Lénaïc Chizat (2022). Mean-Field Langevin Dynamics: Exponential Convergence and Annealing. Transactions on Machine Learning Research
[b17] Lénaïc Chizat (2022). Sparse optimization on measures with over-parameterized gradient descent. Mathematical Programming
[b18] Lénaïc Chizat; Francis Bach (2018). On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems
[b19] Lénaïc Chizat; Francis Bach (2020). Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. PMLR
[b20] Yohann De; Castro ; Fabrice Gamboa (2012). Exact reconstruction using Beurling minimal extrapolation. Journal of Mathematical Analysis and applications
[b21] Quentin Denoyelle; Vincent Duval; Gabriel Peyré; Emmanuel Soubies (2019). The sliding Frank-Wolfe algorithm and its application to super-resolution microscopy. Inverse Problems
[b22] Christopher Frye; J Costas;  Efthimiou (2012). Spherical harmonics in p dimensions. 
[b23] Qiang Fu; Ashia Wilson (2023). Mean-field Underdamped Langevin Dynamics and its Space-Time Discretization. 
[b24] Sébastien Gadat; Yohann De Castro; Clément Marteau (2023). FastPart: Over-Parameterized Stochastic Gradient Descent for Sparse optimisation on Measures. 
[b25] A Gray; L Vanhecke (1979). Riemannian geometry as determined by the volumes of small geodesic balls. Acta Mathematica
[b26] Charles Guille-Escuret; Manuela Girotti; Baptiste Goujaud; Ioannis Mitliagkas (2021). A study of condition numbers for first-order optimization. PMLR
[b27] Richard Holley; Daniel W Stroock (1986). Logarithmic Sobolev inequalities and stochastic Ising models. 
[b28] Kaitong Hu; Zhenjie Ren; David Šiška; Łukasz Szpruch (2021). Mean-field Langevin dynamics and energy landscape of neural networks. Annales de l'Institut Henri Poincaré (B) Probabilités et statistiques
[b29] Yunbum Kook; Matthew S Zhang; Sinho Chewi;  Murat A Erdogdu (2024). Sampling from the Mean-Field Stationary Distribution. 
[b30] Daniel Lacker (2018). Mean field games and interacting particle systems. 
[b31] Gert Rg Lanckriet; Nello Cristianini; Peter Bartlett; Laurent El Ghaoui; Michael I Jordan (2004-01). Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research
[b32] M John;  Lee (2018). Introduction to Riemannian manifolds. Springer
[b33] Mufan Li; A Murat;  Erdogdu (2023). Riemannian langevin algorithm for solving semidefinite programs. Bernoulli
[b34] Yuanzhi Li; Tengyu Ma; Hongyang R Zhang (2020). Learning over-parametrized twolayer neural networks beyond NTK. PMLR
[b35] Matthias Liero; Alexander Mielke; Giuseppe Savaré (2018). Optimal entropy-transport problems and a new Hellinger-Kantorovich distance between positive measures. Inventiones mathematicae
[b36] Pierre Marion; Raphaël Berthier (2023). Leveraging the two-timescale regime to demonstrate convergence of neural networks. Advances in Neural Information Processing Systems
[b37] Song Mei; Andrea Montanari; Phan-Minh Nguyen (2018). A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences
[b38] Georg Menz; André Schlichting (2014). Poincaré and logarithmic Sobolev inequalities by decomposition of the energy landscape. The Annals of Probability
[b39] Ilya Molchanov; Sergei Zuyev (2004). Optimisation in space of measures and optimal design. ESAIM: Probability and Statistics
[b40] Atsushi Nitanda; Taiji Suzuki (2017). Stochastic particle gradient descent for infinite ensembles. 
[b41] Atsushi Nitanda; Denny Wu; Taiji Suzuki (2022). Convex analysis of the mean field Langevin dynamics. PMLR
[b42] Felix Otto; Cédric Villani (2000). Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. Journal of Functional Analysis
[b43] Clarice Poon; Gabriel Peyré (2021). Smooth bilevel programming for sparse regularization. Advances in Neural Information Processing Systems
[b44] Clarice Poon; Gabriel Peyré (2023). Smooth over-parameterized solvers for non-smooth structured optimization. Mathematical Programming
[b45] Alain Rakotomamonjy; Francis Bach; Stéphane Canu; Yves Grandvalet (2008). Sim-pleMKL. Journal of Machine Learning Research
[b46] Grant Rotskoff; Eric Vanden-Eijnden (2022). Trainability and accuracy of artificial neural networks: An interacting particle system approach. Communications on Pure and Applied Mathematics
[b47] Filippo Santambrogio (2015). Optimal transport for applied mathematicians. Birkäuser
[b48] Justin Sirignano; Konstantinos Spiliopoulos (2020). Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications
[b49] Taiji Suzuki; Denny Wu; Atsushi Nitanda (2023). Mean-field Langevin dynamics: Time-space discretization, stochastic gradient, and variance reduction. Advances in Neural Information Processing Systems
[b50] Taiji Suzuki; Denny Wu; Kazusato Oko; Atsushi Nitanda (2023). Feature learning via mean-field Langevin dynamics: classifying sparse parities and beyond. 
[b51] Alain-Sol Sznitman (1991). Topics in propagation of chaos. 
[b52] Shokichi Takakura; Taiji Suzuki (2024). Mean-field Analysis on Two-layer Neural Networks from a Kernel Perspective. 
[b53] Roman Vershynin (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press
[b54] Cédric Villani (2009). Optimal transport: old and new. Springer
[b55] Yuling Yan; Kaizheng Wang; Philippe Rigollet (2023). Learning gaussian mixtures using the Wasserstein-Fisher-Rao gradient flow. 

Figures:
Figure fig_0: 
Type: figure
Caption: 4 h : P 1 (Ω) → M(W) characterized by ∀φ ∈ C(W, R), W φ(w)(hµ)(dw) = Ω rφ(w)µ(dr, dw), where P p (Ω) is the subset of P(Ω) for which |r| p dµ(dr, dw) < +∞. For instance, it acts on discrete measures as h 1 m m j=1 δ (rj ,wj ) = 1 m m j=1 r j δ wj . We also define, for b ∈ [1, 2] and µ ∈ P b (Ω), Ψ b (µ) := Ω |r| b dµ(r, w) 2/b . The objective functional of the lifted problem is then defined, for µ ∈ P b (Ω), as F λ,b (µ) := G(hµ) + λ 2 Ψ b (µ). (3.1) It is equivalent to minimize G λ or F λ,b , as shown in the following statement. Proposition 3.1. Let ν ∈ M(W). For any µ ∈ P b (W) such that hµ = ν, it holds F λ,b (µ) ≥ G λ (ν), and equality holds for µ(dr, dw) = δ f (w) (dr) |ν|(dw) ∥ν∥ T V
Data: 

Figure fig_1: 
Type: figure
Caption: [Chi22b, Sec. 5.1]. Remark 3.1. For functionals of the form G λ,s = G(ν)+ λ s ∥ν∥ s T V , instead of (1.1) which corresponds to s = 2, one can formulate a similar reduction by posing Ψ b,s (µ) = ( Ω |r| b dµ(r, w)) s/b and F λ,b,s (µ) = G(hµ) + λ s Ψ b,s (µ). The statements of Prop. 3.1 and Prop. 3.2 hold true with G λ replaced by G λ,s , and F λ,b by F λ,b,s , for any 1 ≤ b ≤ s, as can be shown by very simple adaptations of the proofs (only the second inequality in the proof of Lem. C.1, and the definition of λ ′ in (C.
Data: 

Figure fig_2: 3
Type: figure
Caption: . 3 )3It can be derived using the variational representation of the squared TV-norm [LCBGJ04; Bac19]: for any ν ∈ M(Ω), one has ∥ν∥ 2 T V = min η∈P(W) W |ν| 2 η . By exchanging infima, it thus holds inf ν∈M(W) G λ (ν) = inf η∈P(W),ν∈M(W) G(ν) + λ 2 |ν| 2 η
Data: 

Figure fig_3: 
Type: figure
Caption: A comparison of Wasserstein GF on the bilevel objective with (i.e. MFLD) and without noise.
Data: 

Figure fig_4: 
Type: figure
Caption: A comparison of MFLD applied to the Bilevel vs. Lifted formulations.
Data: 

Figure fig_5: 1
Type: figure
Caption: Figure 1 :1Figure 1: The regularized training loss G λ (ν) (1.1) of a 2NN with the ReLU activation, learning a teacher 2NN with the 4th degree Hermite polynomial as its activation. In both plots, d = 10 and λ = β -1 = 10 -3 . The implementation details are provided in Sec. F.4. Plots are averaged over 5 experiments. G * λ is the best value achieved at each experiment. In Fig. (1b), "Conic" refers to using the metric (3.2) with q r = 1, q w = 1, while "Canonical" refers to the choice of q r = 2, q w = 0.
Data: 

Figure fig_6: 
Type: figure
Caption: DDetails for Sec. 3.2 (reduction by bilevel optimization) D.1 Proof of Prop. 3.3
Data: 

Figure fig_7: 1
Type: figure
Caption: E. 11Proof of Prop. 4.1 We state and prove a more precise version of Prop. 4.1 below. Proposition E.2. Under Assumption 1, let ∆ > 0 and assume that ∆ ≤ 2L0L1G(0) λ 2 J * λ . Then MFLD-Bilevel with the temperature schedule ∀t, β t = 4d ∆J * λ log 4C 1/d B ∆J * λ converges to (1 + ∆)multiplicative accuracy in time
Data: 

Figure fig_8: 1
Type: figure
Caption: F.2. 11Lyapunov function analysis for bounding the LSI constant of δ v ∝ e -βJ ′ λ[δv]  τ Observe that by the assumption g ′ ≥ c 1 > 0 of Thm. 5.2, J ′ λ [δ v ] = -λg(⟨v, •⟩) has a unique global minimum at v. Moreover, our other assumptions on g will imply that the Riemannian Hessian at optimum∇ 2 J ′ λ [δ v ](v)is positive definite. This motivates us to follow the strategy of [LE23, Thm. 3.4] for proving LSI for δ v ∝ e -βJ ′ λ [δv] τ . Let us first outline the strategy and recall some useful classical notions. The generator of the Langevin diffusion with invariant measure exp(-βf )τ /Z is L = ∆ -β⟨∇f, ∇⟩.(F.3) Define U = {w : dist W (w, v) ≤ r} for some v ∈ S d , with r > 0 to be chosen later. We sayW : S d → [1, ∞) is a Lyapunov function if LW W ≤ -θ + b1 U ,for constants θ > 0 and b ≥ 0. When proving functional inequalities for a Gibbs measure exp(-βf )τ /Z, a typical choice of Lyapunov function is W = exp(β(fmin f )/2), for which the Lyapunov condition writes say a probability measure ν ∈ P(S d ) satisfies a local Poincaré inequality on U with constant κ U if U f 2 dν ≤ 1 κ U U ∥∇f ∥ 2 dν, for all smooth f : U → R such that U f dν = 0. Notice that U has a convex boundary, thus we can use the Bakry-Émery criterion as adapted to manifolds with convex boundaries by [LE23, Proposition B.11] to prove a local Poincaré inequality on U . Specifically, it suffices to have inf w∈U λ min (∇ 2 f (w)) > 0.
Data: 

Figure fig_9: 
Type: figure
Caption: Proof..By the Lyapunov criterion for Poincaré inequality [BGL14, Thm. 4.6.2], if the generator L given by (F.3) satisfies the Lyapunov condition LW W ≤ -θ + b1 U for some θ > 0, b ≥ 0, U ⊂ S d and W : S d → R, and if ν satisfies a local Poincaré inequality on U with constant κ U , then ν satisfies a Poincaré inequality on S d with constant κ ≥ θ 1+ b κ U Let us apply this to W = exp(β(fmin f )/2). By (L S d ) and (L U ), the Lyapunov condition holds with θ = D 2 β 2 λ 2 and b = D 1 λβ(d -1) + D 2 β 2 λ 2 . Moreover, since U has a convex boundary (the geodesic in S d between any two points in U remains in U for r < π/2), by [LE23, Propostion B.11] ν satisfies a local Poincaré inequality on U with constant κ U ≥ Ric g + βλ min (∇ 2 f (w)) ≥ d -1 + βλD 4
Data: 

Figure fig_10: 
Type: figure
Caption: satisfies the conditions of Thm. F.6 with D 0 , ..., D 4 , r dependent only on c 1 , C 1 , C 2 , C 3 . Proof. The Riemannian gradient and Hessian of f 0 = J ′ λ [δ v ] = -λg(⟨v, •⟩) are given by ∇f 0 (w) = -λg ′ (⟨w, v⟩)Π w v and ∇ 2 f 0 (w) = -λΠ w g ′′ (⟨w, v⟩)vv ⊤g ′ (⟨w, v⟩) ⟨w, v⟩ I d+1 Π w where Π w = I d+1ww ⊤ : R d+1 → T w S d = {w} ⊥ for any w ∈ S d . This can be shown by considering the smooth extension of f 0 to R d+1 → R defined by x → -λg(⟨v, x⟩) and using that S d is a sub-Riemannian manifold of R d+1 [Bou23, Chap. 5]. In particular since v ⊤ Π w Π w v = 1-⟨w, v⟩ 2 and Tr Π w = d,
Data: 

Figure fig_11: 
Type: figure
Caption: Finally, let usverify (C U ). Indeed, for any w ∈ U , λ min (∇ 2 f 0 (w)) = min ∥u∥ 2 =1,⟨u,w⟩=0 -λg ′′ (⟨w, v⟩)⟨u, v⟩ 2 + λg ′ (⟨w, v⟩)⟨w, v⟩ ≥ -λ |g ′′ (⟨w, v⟩)| max ∥u∥ 2 =1,⟨u,w⟩=0 ⟨u, v⟩ 2 + λc 1 ⟨w, v⟩ = -λ |g ′′ (⟨w, v⟩)| (1 -⟨w, v⟩ 2 ) + λc 1 ⟨w, v⟩,
Data: 

Figure fig_12: 3
Type: figure
Caption: F. 33Proof of Prop. 5.3 (examples of activations satisfying the assumptions)Before presenting the proof, we recall a few concepts from the theory of spherical harmonics, and refer to [AH12; FE12] for more details. Let τ be the uniform probability measure on S d . The spherical harmonics in dimension d + 1 form an orthonormal basis of L 2 (τ ). We denote them by {Y kj } k,j , where k ≥ 0 and 1 ≤ j ≤ N (d, k), where N (d, 0) = 1 and N(d, k) = 2k+d-1 k k+d-2 d-1for k ≥ 1 (for k = 0 we have Y 01 = 1). Consequently, any ϕ ∈ L 2 (τ ) can be written asϕ = ∞ k=0 N (d,k) j=1 ⟨ϕ, Y kj ⟩ L 2 (τ ) Y kj . Let P k,d be the Legendre polynomial (a.k.a. Gegenbauer polynomial) of degree k in dimension d + 1, normalized such that P k,d (1) = 1. Thanks to Rodrigues' formula [AH12, Theorem 2.23], we can express Legendre polynomials as,
Data: 

Figure fig_13: 
Type: figure
Caption: denotes the jth derivative of P k,d , andc j,k,d = k(k -1) . . . (kj + 1)(k + d -1)(k + d) . . . (k + d + j -2) d(d + 2) . . . (d + 2j -2) . (F.8)Notice that for j > k we have P (j) k,d = 0.
Data: 

Figure fig_14: 
Type: figure
Caption: N⟨φ r (⟨w, •⟩), Y kj (•)⟩ L 2 (τ ) = ᾱk,r Y kj (w) := α k,r N (d, k) Y kj (w), )P k (t)(1t 2 ) (d-2)/2 dt.Then, by the expansion of φ r (⟨w, •⟩) in the basis of spherical harmonics,φ r (⟨w, •⟩) = ∞ k=0 N (d,k) j=1 α k,r N (d, k) Y kj (w)Y kj (•) = ∞ k=0 N (d, k)α k,r P k,d (⟨w, •⟩). (F.10)Via the formula for inner products of Legendre polynomials, we obtainq r (⟨w, v⟩) = ∞ k=0 α 2 k,r N (d, k)⟨P k,d (⟨w, •), P k,d (⟨v, •)⟩ L 2 (τ ) = ∞ k=0 α 2 k,r P k,d (⟨w, v⟩). c j,k,d P k-j,d+2j (⟨w, v⟩), (F.11)where c j,k,d is given by (F.8). On the other hand, we can directly obtain from (F.10),φ (j) r (⟨w, x⟩) = ∞ k=0 (d, k)α k,r P (j) k,d (⟨w, x⟩) = ∞ k=j N (d, k)α k,r c j,k,d P k-j,d+2j (⟨w, x⟩).Therefore,⟨φ (j) r (⟨w, •⟩), φ (j) r (⟨v, •⟩)⟩ L 2 (τ ) = ∞ k=j α 2 k,r c 2 j,k,d N (d, k) N (d + 2j, kj) P k-j,d+2j (⟨w, v⟩).Moreover, it is straightforward to verify thatc j,k,d N (d, k) N (d + 2j, kj) = (d + 1)(d + 3) . . . (d + 2j -1)for k ≥ j. Therefore,⟨φ (j) r (⟨w, •⟩), φ (j) r (⟨v, •⟩)⟩ L 2 (τ ) = (d + 1)(d + 3) . . . (d + 2j -1)∞ k=j α 2 k,r c j,k,d P k-j,d+2j (⟨w, v⟩) = (d + 1)(d + 3) . . . (d + 2j -1)q (j) r (⟨w, v⟩),
Data: 

Figure fig_15: 22
Type: figure
Caption: 2 H ) 2 .22∥x∥ (⟨w, v⟩) = E ∥x∥ ∥x∥ 2j (d + 1)(d + 3) . . . (d + 2j -1) φ (j) (∥x∥ ⟨w, x⟩)φ (j) (∥x∥ ⟨v, x⟩)dτ (x) = ∥x∥ 2j (d + 1)(d + 3) . . . (d + 2j -1) φ (j) (⟨w, x⟩)φ (j) (⟨v, x⟩)dρ(x), which concludes the proof.We are now ready to state the proof of Prop. 5.3Proof of Prop. 5.3. Recall g(⟨w, v⟩) = ⟨ϕ(w),ϕ(v)⟩ 2 H 2(λ+∥ϕ(v)∥ 2 H ) 2 . Let q(⟨w, v⟩) = ⟨ϕ(w), ϕ(v)⟩ H . Consequently, = qq ′′ + q ′ 2 (λ + ∥ϕ(v)∥ 2 H ) 2 , g ′′′ = 3q ′ q ′′ + qq ′′′ (λ + ∥ϕ(v)∥We proceed to bound each term separately. By non-negativity of ϕ, for any r > 0, we have q(⟨w, v⟩) = E [φ(⟨w, x⟩)φ(⟨v, x⟩)] ≥ E φ(⟨w, x⟩)ϕ(⟨v, x⟩)1 |⟨w, x⟩| ≤ r, |⟨v, x⟩| ≤ r ≥ ( inf |z|≤r φ(z)) 2 P [{|⟨w, x⟩| ≤ r} ∩ {|⟨v, x⟩| ≤ r}] ≥ ( inf |z|≤r φ(z)) 2 1 -P[⟨w, x⟩ 2 > r 2 ] -P[⟨v, x⟩ 2 > r 2 ] 1)r 2 ,where the last inequality follows from Markov inequality along with the fact thatE[xx ⊤ ] = E[∥x∥ 2 ]d+1 I d+1 for spherically symmetric distributions. Thus, by choosing r = m = 2b2 √ b2 b1
Data: 

Figure fig_16: 222
Type: figure
Caption: ) 2 . 2 L 2222Furthermore, by Lem. F.13 and the Cauchy-Schwartz inequality, (ρ) ) 2 , and|g ′′′ | ≤ 3b 3 2 ∥ϕ ′ ∥ 2 L 4 (ρ) ∥ϕ ′′ ∥ 2 L 4 (ρ) + b 3 2 ∥ϕ∥ 2 L 2 (ρ) ∥ϕ ′′′ ∥ 2 L 4 (ρ) (λ + ∥φ∥ 2 L 2 (ρ) ) 2
Data: 

Figure tab_2: 
Type: table
Caption: ⟨w, x⟩)φ ′ (⟨v, x⟩) | ∥x∥]
Data: ≥ ≥ ≥(inf |z|≤r φ ′ (z)) 2 d + 1 (inf |z|≤r φ ′ (z)) 2 d + 1 (inf |z|≤r φ ′ (z)) 2 d + 1E ∥x∥ E ∥x∥ 2 2 1 -P ⟨w, x⟩ 2 > P |⟨w, x⟩| ≤ r ∥x∥ E ∥x∥ 2 1 -2 2 ∥x∥ r 2 (d + 1)∩ |⟨v, x⟩| ≤ r 2 ∥x∥ 2 | ∥x∥ -P ⟨v, x⟩ 2 > r | ∥x∥ ∥x∥ .r 2 ∥x∥ 2 | ∥x∥√ b1 Consequently, by choosing r = m = 2b2b2

Figure tab_3: 
Type: table
Caption: • It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. • For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). • If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: The very small scale of the numerical experiment of Fig. 1 means that any standard laptop or desktop computer can be used to reproduce it in, with a runtime of a few minutes. Guidelines: • The answer NA means that the paper does not include experiments. • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. • The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. • The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn't make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] Justification: We have reviewed the Code of Ethics and have not found any deviation of our work from it. Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics. • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. • The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA]
Data: 


Formulas:
Formula formula_0: min ν∈M(W) G λ (ν), G λ (ν) := G(ν) + λ 2 ∥ν∥ 2 T V ,(1.1)

Formula formula_1: µ∈P(W ′ ) F β (µ), F β (µ) := F (µ) + β -1 H(µ),(1.3)

Formula formula_2: s∈TωΩ ∥s∥ ω ≤1 ∇ 2 F ′ [µ](s, s) ≤ L, and ∀µ, µ ′ ∈ P 2 (Ω), ∀ω ∈ Ω, ∥∇F ′ [µ] -∇F ′ [µ ′ ]∥ ω ≤ L W 2 (µ, µ ′ ),

Formula formula_3: ∀µ ′ ∈ P(Ω), H (µ ′ |μ) ≤ 1 2α I(µ ′ |μ),

Formula formula_4: H (µ ′ |μ) := Ω log dµ ′ dμ dµ ′ , I(µ ′ |μ) := Ω ∇ log dµ ′ dμ (ω) 2 ω dµ ′ (ω),

Formula formula_5: β -1 H(µ t |µ * β ) ≤ F β (µ t ) -F β (µ * β ) ≤ exp(-2β -1 α t) F β (µ 0 ) -F β (µ * β ) .

Formula formula_6: min µ∈P b (Ω) F λ,b (µ) = min ν∈M(W) G λ (ν).

Formula formula_7: δr 1 δw 1 , δr 2 δw 2 (r,w) = Γ -1 |r| qr δr 1 δr 2 r 2 + |r| qw ⟨δw 1 , δw 2 ⟩ w . (3.2)

Formula formula_8: J λ (η) := inf ν∈M(W) G(ν) + λ 2 W |ν| 2 η . (3

Formula formula_9: P(W) J λ = inf M(W) G λ . Moreover, if G λ admits a minimizer ν ∈ M(W), then arg min J λ = |ν| ∥ν∥ T V , ν ∈ arg min G λ .

Formula formula_10: min η∈P(W) min f ∈L 2 (η) G(f η) + λ 2 W f (w) 2 dη(w).

Formula formula_11: dr i t = -Γ ∇ r i F ′ λ,2 [µ t ](r i t , w i t )dt = -Γ G ′ [ν t ](w i t ) + λr i t dt (3.4) dw i t = -∇ w i F ′ λ,2 [µ t ](r i t , w i t )dt + 2β -1 dB i t = -r i t ∇G ′ [ν t ](w i t )dt + 2β

Formula formula_12: µ t = 1 N N i=1 δ (r i t ,w i t ) and ν t = 1 N N i=1 r i t δ w i t , and taking η t = 1 N n i=1 δ w i t .

Formula formula_13: L i , B i < ∞ such that ∇ i G ′′ [ν](w, w ′ ) w ≤ L i and ∇ i G ′ [ν] w ≤ L i ∥ν∥ T V + B i for all ν ∈ M(W) and w, w ′ ∈ W. Moreover there exists L 2 < ∞ such that ∥∇ w ∇ w ′ G ′′ [ν](w, w ′ )∥ ≤ L 2

Formula formula_14: η ∈ P(W), J λ + β -1 H satisfies local LSI at η with the constant α η = α τ exp -1 λ L 0 βJ λ (η) . Further, J λ + β -1 H satisfies α-LSI uniformly along the MFLD trajectory (η t ) t with the constant α = α τ exp -1 λ L 0 β min G(0), J λ (η 0 ) + β -1 H (η 0 |τ ) .

Formula formula_15: ∂ t η t = div(η t ∇J ′ λ [η t ]) + β -1 t ∆η t satisfies J λ (η t ) -inf J λ = O log log t log t

Formula formula_16: ∆ = 0.01), if J λ (η T∆ ) ≤ (1 + ∆)J * λ .

Formula formula_17: β t = 4d ∆J * λ log CB ∆J * λ converges to (1 + ∆)-multiplicative accuracy in time T ∆ ≤ C ′ ∆J * λ log CB ∆J * λ • exp C ′ L 0 G(0) λ ∆J * λ log CB ∆J * λ • log 2G(0) ∆J * λ + C ′ H (η 0 |τ )

Formula formula_18: t k+1 -t k = C 1 2 k k • exp L 0 d λ C 3 ∆ log B ∆J * λ + C 2 ,

Formula formula_19: T ∆ ≤ t K+1 ≤ C 4 ∆J * λ log B ∆J * λ 2 • exp L 0 d λ C 3 ∆ log B ∆J * λ + C 2 .

Formula formula_20: J * λ = inf G + λ 2 ∥•∥ 2 T V = ∥ν0∥ 2 T V

Formula formula_21: J λ,β (η t ) -inf J λ,β ≤ (J λ,β (η 0 ) -inf J λ,β ) e -(α * β β -1 -ε)t .

Formula formula_22: G(ν) = 1 2 E x∼ρ |ŷ ν (x) -y(x)| 2 where ŷν (x) = W φ(⟨w, x⟩)dν(w).

Formula formula_23: J λ (η) = λ 2 ⟨y, (K η + λ id) -1 y⟩ L 2 ρ , J ′ λ [η](w) = - λ 2 ⟨φ(⟨w, •⟩), (K η + λ id) -1 y⟩ 2 L 2

Formula formula_24: [-1, +1] → R + such that J ′ λ [δ v ](w) = -λg(⟨w, v⟩)

Formula formula_25: c 1 ≤ g ′ (r) ≤ C 1 , g ′′ (r) ≥ -C 2 , g ′′ (r)(1 -r 2 ) 1/2 ≤ C 3 , g ′′′ (r)(1 -r 2 ) 3/2 ≤ C 4 .

Formula formula_26: β ≥ D 0 dλ -1 , δ v ∝ e -βJ ′ λ [δv] τ satisfies α v -LSI. Furthermore, if additionally 1 d 2 E x∼ρ ∥x∥ 4 , φ (i) L 4 (ρ) < ∞ for i ∈ {0, 1, 2} where ∥φ∥ p L p (ρ) := |φ(⟨w, x⟩)| p dρ(x) (independent

Formula formula_27: J ′ λ [η λ,β ].

Formula formula_28: 1 (d + 1) ≤ E[∥x∥ 2 ] ≤ E[∥x∥ 12 ] 1/6 ≤ b 2 (d + 1) for constants b 1 , b 2 > 0. Let m := 2b 3/2

Formula formula_29: A.1 Using ∥•∥ 2 T V vs. ∥•∥ T V

Formula formula_30: min ν∈M(W) G λ (ν), G λ (ν) := G(ν) + λ 2 ∥ν∥ 2 T V .

Formula formula_31: min ν∈M(W) G λ(ν), G λ(ν) := G(ν) + λ ∥ν∥ T V ,

Formula formula_32: {0} ∪ λ≥0 arg min G λ = {0} ∪ λ≥0 arg min G λ

Formula formula_33: λ≥0 arg min G λ = ν ∈ M(W); ∀w, G ′ [ν](w) + λ ∥ν∥ T V ν(dw) |ν(dw)| = 0, λ ∈ R + λ≥0 arg min G λ = ν ∈ M(W); ∀w, G ′ [ν](w) + λ ν(dw) |ν(dw)| = 0, λ ∈ R + .

Formula formula_34: x → 1 N N i=1 a i φ(w ⊤ i x).

Formula formula_35: L(µ) = R R×W aϕ(w)dµ(a, w) = R W ϕ(w)d[hµ](w) .

Formula formula_36: G(ν) = R(Φν) = R W ϕ(w)dν(w) .

Formula formula_37: F(µ) = R R×W aϕ(w)dµ(a, w) + λ 2 R×W a 2 dµ(a, w) + 1 2σ 2 R×W ∥w∥ 2 dµ(a, w) (A.1) = G(hµ) + λ 2 R×W a 2 dµ(a, w) + 1 2σ 2 R×W ∥w∥ 2 dµ(a, w).

Formula formula_38: F(µ) = F λ,2 (µ) + 1 2σ 2 R×W ∥w∥ 2 dµ(a, w).

Formula formula_39: P(W) → R such that G(η) = inf f :W→R F δ f (w) (dr)η(dw) , corresponding precisely to G(η) = J λ (η) + 1 2σ 2 W ∥w∥ 2 dη(w)

Formula formula_40: ∂ t η t = β -1 ∆η t + div (η t ∇G ′ [η t ]) = β -1 ∆η t + div η t ∇J ′ λ [η t ] + 1 σ 2 w (A.2)

Formula formula_41: J λ,β,σ = J λ + β -1 H + 1 2σ 2 W ∥w∥ 2 dη(w) = J λ + β -1 H • β -1/2 σγ where β -1/2 σγ := N (0, β -1 σ 2 I d ),

Formula formula_42: J λ,β = J λ +β -1 H (•|τ ) = J λ +β -1 H +cst.

Formula formula_43: F + β -1 H • βσ 2 γ .

Formula formula_44: Proposition B.1. Suppose F : P 2 ((Ω, g)) → R is twice differentiable in the Wasserstein sense. Let 0 ≤ L < ∞. Suppose that F satisfies (P1), i.e., ∀µ ∈ P 2 (Ω), ∀ω ∈ Ω, max s∈TωΩ ∥s∥ ω ≤1 ∇ 2 F ′ [µ](s, s) ≤ L and ∀µ, µ ′ ∈ P 2 (Ω), ∀ω ∈ Ω, ∥∇F ′ [µ] -∇F ′ [µ ′ ]∥ ω ≤ L W 2 (µ, µ ′ )

Formula formula_45: ∂ t µ t = -div(∇ϕ t µ t ) ∂ t ϕ t = -1 2 ∥∇ϕ t ∥ 2 and dµ t ∥∇ϕ t ∥ 2 = W 2 2 (µ 0 , µ 1 ) = 1 for all t.

Formula formula_46: f ′ (t) = d dt F (µ t ) = dµ t ⟨∇F ′ [µ t ], ∇ϕ t ⟩ f ′′ (t) = d(∂ t µ t ) ⟨∇F ′ [µ t ], ∇ϕ t ⟩ + dµ t d dt ∇F ′ [µ t ], d dt ∇ϕ t = dµ t ∇ ⟨∇F ′ [µ t ], ∇ϕ t ⟩ , ∇ϕ t + dµ t ∇F ′ [µ t ], d dt ∇ϕ t + d dt ∇F ′ [µ t ], ∇ϕ t = dµ t ∇ 2 F ′ [µ t ](∇ϕ t , ∇ϕ t ) + dµ t ∇ 2 ϕ t (∇F ′ [µ t ], ∇ϕ t ) + dµ t ⟨∇F ′ [µ t ], ∇∂ t ϕ t ⟩ + dµ t d dt ∇F ′ [µ t ], ∇ϕ t .

Formula formula_47: dµ t ∇ 2 F ′ [µ t ](∇ϕ t , ∇ϕ t ) = dµ t ∥∇ϕ t ∥ 2 ∇ 2 F ′ [µ t ](s t , s t ) ≤ L • dµ t ∥∇ϕ t ∥ 2 = L.

Formula formula_48: ∂ t ϕ t = -1 2 ∥∇ϕ t ∥ 2 .

Formula formula_49: dµ t d dt ∇F ′ [µ t ], ∇ϕ t ≤ dµ t ∥∇ϕ t ∥ • sup t∈[0,1] sup ω∈Ω d dt ∇F ′ [µ t ](ω) since dµ t ∥∇ϕ t ∥ 2 ≤ dµ t ∥∇ϕ t ∥ 2 = 1.

Formula formula_50: µ ′ = µ s , ∥∇F ′ [µ s ](ω) -∇F ′ [µ t ](ω)∥ ω s -t ≤ L W 2 (µ s , µ t ) s -t = L since (µ t ) t is a constant-speed geodesic with W 2 (µ 0 , µ 1 ) = 1. So by letting s → t we obtain that d dt ∇F ′ [µ t ](ω) ≤ L for all t ∈ [0, 1], ω ∈ Ω. Thus we have shown |f ′′ (t)| ≤ 2L

Formula formula_51: ∀φ ∈ C(W, R), W φ(w)(h p µ)(dw) = Ω sign(r) |r| p φ(w)µ(dr, dw).

Formula formula_52: h p 1 m m j=1 δ (rj ,wj ) = 1 m m j=1 sign(r j ) |r j | p δ wj . Lemma C.1. For b ∈ [1, 2] and p > 0, let Ψ b,p : P(Ω) → R ∪ {+∞} defined by Ψ b,p (µ) := Ω |r| pb dµ(r, w) 2/b

Formula formula_53: min µ s.t. h p µ=ν Ψ b,p (µ) = ∥ν∥ 2 T V . Moreover, if b = 1 then the set of minimizers is {µ ∈ P(W); h p µ = ν and ∀w, supp(µ(•|w)) ⊂ R + or supp(µ(•|w)) ⊂ R -} , and if b > 1 there is a unique minimizer which is δ f (w) (dr) |ν|(dw) ∥ν∥ T V where f (w) = ∥ν∥ 1/p T V dν d|ν| (w).

Formula formula_54: ∥h p µ∥ T V = max ϕ:W→[-1,1] Ω sign(r) |r| p ϕ(w)dµ(r, w) ≤ Ω |r| p dµ(r, w) so ∥ν∥ 2 T V = ∥h p µ∥ 2 T V ≤ Ω |r| p dµ(r, w) b 2/b ≤ Ω |r| pb dµ(r, w) 2/b = Ψ b,p (µ),

Formula formula_55: 2 T V is attained by letting µ(dr, dw) = δ f (w) (dr) |ν|(dw) ∥ν∥ T V

Formula formula_56: 1/p T V dν d|ν| (w). This proves that min µ:h p µ=ν Ψ b,p (µ) = ∥ν∥ 2 T V .

Formula formula_57: (•|w)) ⊂ R + or supp(µ(•|w)) ⊂ R -}.

Formula formula_58: dw) = δ f (w) (dr) |ν|(dw) ∥ν∥ T V

Formula formula_59: µ∈P(Ω) F λ,b,p (µ) where F λ,b,p (µ) = G(h p µ) + λ 2 Ψ b,p (µ). (C.1)

Formula formula_60: Then min P(Ω) F λ,b,p = min M(W) G λ . Moreover, if b > 1 then arg min F = δ ∥ν∥ 1/p T V dν d|ν| (w) (dr) ν(dw) ∥ν∥ T V

Formula formula_61: arg min F = {µ; h p µ ∈ arg min G and ∀w, supp(µ) ⊂ R + or supp(µ) ⊂ R + }. Furthermore, F is convex.

Formula formula_62: min µ∈P(Ω) F (µ) = min µ∈P(Ω) G(h p µ) + λ 2 Ψ b,p (µ) = min ν∈M(Ω) min µ∈P(Ω):h p =ν G(h p µ) + λ 2 Ψ b,p (µ) = min ν∈M(Ω) G(ν) + λ 2 min µP(Ω):h p =ν Ψ b,p (µ) = min ν∈M(Ω) G(ν) + λ 2 ∥ν∥ 2 T V = min ν∈M(Ω) G λ (ν)

Formula formula_63: δr 1 δw 1 , δr 2 δw 2 (r,w) = Γ -1 |r| qr δr 1 δr 2 r 2 +|r| qw ⟨δw 1 , δw 2 ⟩ w , i.e., g (r,w) = Γ -1 |r| qr-2 0 0 |r| qw g w .

Formula formula_64: ) = Γ -1 |r| qr-2 0 0 |r| qw g w on Ω * = R * × W.

Formula formula_65: T α : Ω * , g [qr,qw,Γ] → Ω * , g [ qr α , qw α ,α 2 Γ] defined by T α (r, w) = (sign(r) |r| α , w) is an isometry. Proof. Since Ω * is a disjoint manifold: Ω * = R * + ×W ∪ R * -×W, and since T α (R * + ×W) = R * + ×W, it suffices to check that the restricted map T + α : R * + × W, g [qr,qw,Γ] → R * + × W, g [ qr α , qw α ,α 2 Γ]

Formula formula_66: = T + α (r, w) = (r α , w), so da a = α dr r , δr 1 δw 1 • g (r,w) δr 2 δw 2 = δa 1 δw 1 • g(a,w) δa 2 δw 2 = αa 1 r δr 1 δw 1 • g(a,w) αa 1 r δr 2 δw 2 so g(a,w) = r αa 0 0 1 g (r,w) r αa 0 0 1 = r 2 α 2 a 2 Γ -1 r qr-2 0 0 r qw g w = Γ -1 α -2 a qr/α-2 0 0 a qw/α g w . So g is precisely g [ qr α , qw α ,α 2 Γ] on R * + × W, which proves the claim.

Formula formula_67: Proposition C.4. Let T : (Ω 1 , g [1] ) → (Ω 2 , g [2]

Formula formula_68: ∂ t µ t = -div(µ t ∇F ′ [µ t ]) (where ∇ denotes Riemannian gradient in (Ω 1 , g [1] )). Then, (μ) t := (T ♯ µ t ) t is a

Formula formula_69: P(Ω 2 ) → R defined by F (μ) = F (T -1 ♯ μ).

Formula formula_70: δy ⊤ g [2]y δy ′ = δx ⊤ g [1]x δx ′ = δy ⊤ ((DT (x)) -1 ) ⊤ g [1]x (DT (x)) -1 δy ′ so g -1 [1]x = (DT (x)) -1 ) g -1 [2]T (x) ((DT (x)) -1 ) ⊤ . Also note that F ′ [μ](y) = F ′ [T -1 ♯ μ](T -1 (y))

Formula formula_71: 1 ε F (μ + εν) -F (μ) = lim ε→0 1 ε F (T -1 ♯ μ + εT -1 ♯ ν) -F (T -1 ♯ μ) . In particular D F ′ [μ](y) = DF ′ [T -1 ♯ μ](T -1 (y))(DT (T -1 (y))) -1 . Then for any φ : Ω 2 → R, d dt Ω2 φdμ t = d dt Ω1 φ(T (x))dµ t (x) = Ω1 Dφ(T (x))DT (x) g -1 [1] DF ′ [µ t ](x)dµ t (x) = Ω1 Dφ(y) g -1 [2] D F ′ [μ t ](y)dμ t (y).

Formula formula_72: ∂ t μt = -div(μ t g -1 [2] D F ′ [μ t ]), i.e., (

Formula formula_73: Ω * = R * × W. Fix q r , q w ∈ R, Γ, p, λ > 0 and b ∈ [1, 2]. Let (µ t ) t the Wasserstein gradient flow for F λ,b,p over (Ω * , g [qr,qw,Γ]

Formula formula_74: p = p α , qr = q r α , qw = q w α , Γ = α 2 Γ, λ = λ, b = b.

Formula formula_75: T = T α , Ω 1 = (Ω * , g [qr,qw,Γ] ), Ω 2 = (Ω * , g [q ′

Formula formula_76: F λ,b,p ((T α ) -1 ♯ μ) = F λ,b,p ((T α -1 ) ♯ μ) = G (h p (T α -1 ) ♯ μ) + λ 2 Ψ b,p ((T α -1 ) ♯ μ) ,

Formula formula_77: W → R, W φd [h p (T α -1 ) ♯ μ] = R W φ(w) sign(r) |r| p [(T α -1 ) ♯ μ] (dr, dw) = R W φ(w) sign(r) |r| p/α μ(dr, dw) = W φd h p/α μ ,and

Formula formula_78: Ψ b,p ((T α -1 ) ♯ μ) = |r| pb d [(T α -1 ) ♯ μ] 2/b = |r| pb/α dμ(r, w) 2/b . This confirms that F • T -1 ♯ = F λ,

Formula formula_79: F ′ λ,b,p [µ](r, w) = sign(r) |r| p G ′ [h p µ](w) + λ ′ |r| pb (C.2)

Formula formula_80: λ ′ = λ 1 b Ψ b,p (µ) 1-b 2 .

Formula formula_81: lim ε→0 1 ε [(G • h p )(µ + εµ ′ ) -(G • h p )(µ)] = lim ε→0 1 ε [G(h p µ + εh p µ ′ ) -G(h p µ)] = W G ′ [h p µ](w)d [h p µ ′ ] (w) = R×W sign(r) |r| p G ′ [h p µ](w)dµ ′ (r, w)

Formula formula_82: (G • h p ) ′ [µ](r, w) = sign(r) |r| p G ′ [h p µ](w). Moreover Ψ b,p (µ) = Ω |r| pb dµ(r, w) 2 b Ψ ′ b,p [µ](r, w) = 2 b Ω |r| ′pb dµ(r ′ , w ′ ) 2 b -1 |r| pb = 2 b Ψ b,p (µ) 1-b 2 |r| pb .

Formula formula_83: F λ,b,p = G•h p + λ 2 Ψ b,p .

Formula formula_84: f := F ′ λ,b [µ] R * + ×W the restriction of F ′ λ,b [µ] to R * + × W must have Lipschitz-continuous Riemannian gradients. More explicitly, by (C.2), f (r, w) = rG ′ [ν](w) + λ ′ µ r b where λ ′ µ = λ b Ψ b (µ) 1-b 2 . So by Lem. C.7, necessarily b = 1, and so λ ′ µ = λΨ 1 (µ) 1/2 . If φ := G ′ [ν] satisfies ∇ 2 φ(w) = Γp 2 φ(w) + λ ′ µ g

Formula formula_85: F ′ λ,b [µ ′ ] R * + ×W instead of f , since λ ′ µ ′ ̸ = λ ′ µ ,

Formula formula_86: λ 0µ = sup w∈W |G ′ [hµ](w)| Ψ 1 (µ) 1/2 .

Formula formula_87: Ψ 1 (µ) 1/2 λ < |G ′ [hµ](w 0 )| . Let us distinguish cases between G ′ [hµ](w 0 ) ≥ 0 or G ′ [hµ](w 0 ) < 0. First suppose G ′ [hµ](w 0 ) ≥ 0, so that Ψ 1 (µ) 1/2 λ < G ′ [hµ](w 0 ). By continuity of G ′ [hµ], let N ⊂ W an open neighborhood of w 0 such that ∀w ∈ N, Ψ 1 (µ) 1/2 λ < G ′ [hµ](w). Then, since F ′ λ,1 [µ](r, w) = |r| sign(r)G ′ [hµ](w) + λΨ 1 (µ) 1/2 by (C.2), ∀r ∈ R -, ∀w ∈ N, F ′ λ,1 [µ](r, w) = |r| -G ′ [hµ](w) + λΨ 1 (µ) 1/2 ≤ 0 and so R W e -βF ′ λ,1 [µ](r,w) drdw ≥ R-N e -βF ′ λ,1 [µ](r,w) drdw ≥ R-N 1 drdw = +∞.

Formula formula_88: Likewise, now suppose that G ′ [hµ](w 0 ) < 0, so that Ψ 1 (µ) 1/2 λ < -G ′ [hµ](w 0 ). By continuity of G ′ [hµ], let N ⊂ W an open neighborhood of w 0 such that ∀w ∈ N, Ψ 1 (µ) 1/2 λ < -G ′ [hµ](w). Then ∀r ∈ R + , ∀w ∈ N, F ′ λ,1 [µ](r, w) = |r| G ′ [hµ](w) + λΨ 1 (µ) 1/2 ≤ 0 and so R W e -βF ′ λ,1 [µ](r,w) drdw ≥ R+ N e -βF ′ λ,1 [µ](r,w) drdw ≥ R+ N 1 drdw = +∞.

Formula formula_89: g (r,w) = α(r) -1 0 0 β(r) -1 g w

Formula formula_90: ∇ 2 f 00 = α(r) 2 D 2 rr f + 1 2 α(r)α ′ (r)D r f ∇ 2 f i0 = ∇ 2 f 0i = α(r)β(r)∇D r f r (w) i + 1 2 α(r)β ′ (r)∇f r (w) i ∇ 2 f ij = β(r) 2 ∇ 2 f r (w) ij - 1 2 α(r)β ′ (r) • D r f • (g -1 w ) ij .

Formula formula_91: ∇ 2 f (r, w) IJ = g IK g JL ∂ 2 f ∂ω K ∂ω L -Γ M KL ∂f ∂ M ω and Γ M IJ = 1 2 g M K ∂g KI ∂ω J + ∂g KJ ∂ω I - ∂g IJ ∂ω K

Formula formula_92: Γ 0 00 = - 1 2 α ′ (r) α(r) Γ 0 i0 = Γ 0 0i = 0 Γ 0 ij = 1 2 α(r) β ′ (r) β(r) 2 g ij Γ m 00 = 0 Γ m i0 = Γ m 0i = - 1 2 β ′ (r) β(r) δ m i Γ m ij = Γ m ij .

Formula formula_93: ∇ 2 f 00 = α(r) 2 D 2 rr f + 1 2 α(r)α ′ (r)D r f ∇ 2 f i0 = ∇ 2 f 0i = α(r)β(r)∇D r f r (w) i + 1 2 α(r)β ′ (r)∇f r (w) i ∇ 2 f ij = β(r) 2 ∇ 2 f r (w) ij - 1 2 α(r)β ′ (r) • D r f • g ij ,

Formula formula_94: Ω * + = R * + × W → R defined by f (r, w) = r p φ(w) + λ ′ r pb , for some p > 0, b ∈ [1, 2], λ ′ ≥ 0 and φ : W → R.

Formula formula_95: ∇ 2 f 00 = Γ 2 p(p -q r /2)r 2-2qr+p φ(w) + Γ 2 pbλ ′ (pb -q r /2)r 2-2qr+pb ∇ 2 f i0 = ∇ 2 f 0i = Γ(p -q w /2)r 1-qr-qw+p ∇ φ(w) i ∇ 2 f ij = r p-2qw ∇ 2 φ(w) ij + 1 2 Γq w r -qr-qw • pr p φ(w) + pbλ ′ r pb (g -1 w ) ij .

Formula formula_96: D r f = pr p-1 φ(w) + pbλ ′ r pb-1 D 2 rr f = p(p -1)r p-2 φ(w) + pb(pb -1)λ ′ r pb-2 ∇f r (w) i = r p ∇ φ(w) i ∇ 2 f r (w) ij = r p ∇ 2 φ(w) ij ∇D r f r (w) i = pr p-1 ∇ φ(w) i

Formula formula_97: ∇ 2 f 00 = α(r) 2 p(p -1)r p-2 φ(w) + pb(pb -1)λ ′ r pb-2 + 1 2 α(r)α ′ (r) pr p-1 φ(w) + pbλ ′ r pb-1 = α(r)p α(r)(p -1) + 1 2 rα ′ (r) r p-2 φ(w) + α(r)pbλ ′ α(r)(pb -1) + 1 2 rα ′ (r) r pb-2 ∇ 2 f i0 = ∇ 2 f 0i = α(r)β(r) • pr p-1 ∇ φ(w) i + 1 2 α(r)β ′ (r) • r p ∇ φ(w) i = α(r) β(r)p + 1 2 rβ ′ (r) r p-1 ∇ φ(w) i ∇ 2 f ij = β(r) 2 • r p ∇ 2 φ(w) ij - 1 2 α(r)β ′ (r) • pr p-1 φ(w) + pbλ ′ r pb-1 • g ij .

Formula formula_98: sup ω∈Ω * + sup s∈TωΩ * + ∥s∥ ω =1 ∇ 2 f (ω) IJ g JK s K ω < ∞.

Formula formula_99: H(ω) = √ g IK ∇ 2 f (ω) IJ √ g JL KL

Formula formula_100: g 00 = α(r) -1/2 = Γ -1/2 r qr/2-1 , g i0 = 0, g ij = β(r) -1/2 √ g ij = r qw/2 √ g ijand

Formula formula_101: H(ω) 00 = g 00 ∇ 2 f 00 = Γp(p -q r /2)r -qr+p φ(w) + Γpbλ ′ (pb -q r /2)r -qr+pb H(ω) j0 = g 00 g ji ∇ 2 f i0 = Γ 1/2 (p -q w /2)r -qr/2-qw/2+p • √ g ji ∇ φ(w) i H(ω) kl = g ki g lj ∇ 2 f ij = r p-qw • √ g ki √ g lj ∇ 2 φ(w) ij + Γ 1 2 q w r -qr • pr p φ(w) + pbλ ′ r pb δ kl .

Formula formula_102: ∥ν∥ 2 T V = W |ν(dw)| 2 = inf η∈P(W) W |ν(dw)| 2 η(dw) = inf η∈P(W), f :W→R s.t. f η=ν W |f | 2 dη.

Formula formula_103: η(dw) = |ν(dw)| ∥ν∥ T V

Formula formula_104: ) = |ν(dw)| ∥ν∥ T V

Formula formula_105: inf η∈P(W) J λ (η) = inf η∈P(W),f :W→R G(f η) + λ 2 W |f | 2 dη = inf ν∈M(W) inf η∈P(W), f :W→R s.t. f η=ν G(f η) + λ 2 W |f | 2 dη = inf ν∈M(W) G(ν) + λ 2   inf η∈P(W), f :W→R s.t. f η=ν W |f | 2 dη   = inf ν∈M(W) G(ν) + λ 2 ∥ν∥ 2 T V = inf ν∈M(W) G λ (ν).

Formula formula_106: inf η∈P(W), f :W→R s.t. f η=ν λ 2 W |f |

Formula formula_107: + λ 2 |ν| 2

Formula formula_108: ∀i ≤ N,    dr i t = -Γ ∇ r i F ′ λ,2 1 N N j=1 δ (r j t ,w j t ) (r i t , w i t )dt dw i t = -∇ w i F ′ λ,2 1 N N j=1 δ (r j t ,w j t ) (r i t , w i t )dt + 2β -1 dB i t .

Formula formula_109: F ′ λ,2 [µ](r, w) = rG ′ [hµ](w) + λ 2 |r| 2 so ∇ r F ′ λ,2 [µ](r, w) = G ′ [hµ](w) + λr and ∇ w F ′ λ,2 [µ](r, w) = r∇G ′ [hµ](w).

Formula formula_110: J λ (η) = G(f η η) + λ 2 |f η | 2 dη where f η is the unique solution of the fixed-point equation ∀w ∈ W, f η (w) = - 1 λ G ′ [f η η](w). (D.1)

Formula formula_111: J ′ λ [η](w) = - λ 2 |f η | 2 (w). (D.2)

Formula formula_112: min f ∈L 2 η (W) G(f η) + λ 2 W |f | 2 dη.

Formula formula_113: G ′ [ fη η] η + λ fη η = 0 in M(W). Now let f η = -1 λ G ′ [ fη η],

Formula formula_114: g η = -1 λ G ′ [g η η] = -1 λ G ′ [ fη η] = f η .

Formula formula_115: + λ 2 |f | 2 dη is w → f (w)G ′ [f η](w) + λ 2 |f (w)| 2 , J ′ λ [η](w) = f η (w)G ′ [f η η](w) + λ 2 |f η (w)| 2 = - λ 2 |f η (w)| 2 = - 1 2λ |G ′ [f η η]| 2 (w),

Formula formula_116: sup w∈W |G ′ [ν](w)| 2 ≤ 2L 0 G(ν). Proof. We follow the proof technique of [GGGM21, Appendix D]. Let w 0 ∈ W and ν ′ = ν - 1 L0 G ′ [ν](w 0 )δ w0 . By mean-value theorem there exists θ ∈ (0, 1) such that G(ν ′ ) -G(ν) = G ′ [ν + θ(ν ′ -ν)]d(ν ′ -ν), and so inf G ≤ G(ν ′ ) ≤ G(ν) + G ′ [ν]d(ν ′ -ν) + L 0 2 ∥ν ′ -ν∥ 2 T V = G(ν) - 1 L 0 G ′ [ν](w 0 ) 2 + 1 2L 0 G ′ [ν](w 0 ) 2 = G(ν) - 1 2L 0 G ′ [ν](w 0 ) 2 .

Formula formula_117: ∀w ∈ W, 1 2L 0 G ′ [ν](w) 2 ≤ G(ν) -inf G ≤ G(ν)

Formula formula_118: sup W |f η | ≤ 1 λ 2L 0 J λ (η)

Formula formula_119: sup w∈W ∇ i f η w ≤ L i λ 2 2L 0 J λ (η) + B i λ .

Formula formula_120: λ 2 |f η (w)| 2 = |G ′ [f η η](w)| 2 ≤ 2L 0 G(f η η) ≤ 2L 0 G(f η η) + λ 2 |f η | 2 dη = 2L 0 J λ (η)

Formula formula_121: sup w ∇ i G ′ [ν] w ≤ L i ∥ν∥ T V + B i , so λ ∇ i f η w = ∇ i G ′ [f η η] w ≤ B i + L i ∥f η η∥ T V = B i + L i |f η | dη ≤ B i + L i sup W |f η | ≤ B i + L i 1 λ 2L 0 J λ (η)

Formula formula_122: J λ : J λ (η) = inf f ∈L 2 η G(f η) + λ 2 |f | 2 dη ≤ G(0).

Formula formula_123: ∀η, η ′ ∈ P(W), |J λ (η) -J λ (η ′ )| ≤ BW 2 (η, η ′ )

Formula formula_124: B = 2L 0 G(0) • L1 λ 2 2L 0 G(0) + B1 λ .

Formula formula_125: J ′ λ [η](w) = -λ 2 |f η | 2 (w) so ∇J ′ λ [η](w) = -λf η (w)∇f η (w) ∥∇J ′ λ [η](w)∥ w ≤ λ sup W |f η | • sup W ∥∇f η ∥ ≤ λ • 1 λ 2L 0 G(0) • L 1 λ 2 2L 0 G(0) + B 1 λ =: B < ∞

Formula formula_126: |J λ (η) -J λ (η ′ )| ≤ BW 2 (η, η ′

Formula formula_127: ∀w ∈ W, λh(w) + G ′′ [f η η](w, w ′′ )dη(w ′′ )h(w ′′ ) = -G ′′ [f η η](w, w ′ )f η (w ′ ).

Formula formula_128: sup w∈W |h(w)| ≤ 1 + L0 λ L0 λ 2L 0 G(0).

Formula formula_129: ∀w ∈ W, λh(w) + G ′′ [f η η](w, w ′′ )dη(w ′′ )h(w ′′ ) = -⟨s ′ , ∇ w ′ [G ′′ [f η η](w, w ′ )f η (w ′ )]⟩ w ′ .

Formula formula_130: sup w∈W |h(w)| ≤ 1 + L0 λ • 1 + L0 λ L1 λ 2L 0 G(0) + L0B1 λ . Proof. Let G : L 2 η (W) → L 2 η (W) the operator (G h)(w) = G ′′ [f η η](w, w ′′ )dη(w ′′ ) h(w ′′ ).

Formula formula_131: |G ′′ [f η η](w, w ′ )| ≤ L 0 . Note that G ′′ [f η η](w, w ′′

Formula formula_132: V 1 (•) = -G ′′ [f η η](•, w ′ )f η (w ′ ). By Lem. D.4 we have ∥V 1 ∥ L 2 η ≤ sup W |V 1 | ≤ sup W×W |G ′′ [f η η]| • sup W |f η | ≤ L 0 • 1 λ 2L 0 G(0) =: V 1 . Also let V 2 (•) = -⟨s ′ , ∇ w ′ [G ′′ [f η η](•, w ′ )f η (w ′ )]⟩ w ′ .

Formula formula_133: ∥V 2 ∥ L 2 η ≤ sup W |V 2 | ≤ sup w,w ′ ∥∇ w ′ G ′′ [f η η](w, w ′ )∥ • sup W |f η | + sup W×W |G ′′ [f η η]| • sup W ∥∇f η ∥ ≤ L 1 • 1 λ 2L 0 G(0) + L 0 • L 1 λ 2 2L 0 G(0) + B 1 λ = 1 + L 0 λ L 1 λ 2L 0 G(0) + L 0 B 1 λ =: V 2 .

Formula formula_134: |h| 2 dη = h L 2 η = (λ id +G) -1 V j L 2 η ≤ λ -1 ∥V j ∥ L 2 η ≤ λ -1 V j

Formula formula_135: λh(w) = V j (w) -dη(w ′′ )G ′′ [f η η](w, w ′′ )h(w ′′ ) λ |h(w)| ≤ |V j (w)| + dη(w ′′ ) |G ′′ [f η η](w, w ′′ )| |h(w ′′ )| ≤ V j + ∥G ′′ [f η η](w, •)∥ L 2 η ∥h∥ L 2 η ≤ V j + L 0 • λ -1 V j .

Formula formula_136: L 0 , L 1 , B 1 , L 2 such that sup W |f η -f η ′ | ≤ HW 2 (η, η ′ ) and sup w∈W ∥∇f η -∇f η ′ ∥ w ≤ H ′ W 2 (η, η ′ ).

Formula formula_137: λf η (w) + G ′ [f η η](w) = 0 so λ δf η (w) δη(w ′ ) + G ′′ [f η η](w, w ′ )f η (w ′ ) + (G ′′ [f η η](w, •)) d η δf η (•) δη(w ′ ) = 0 λ δf η (w) δη(w ′ ) + G ′′ [f η η](w, w ′′ )η(dw ′′ ) δf η (w ′′ ) δη(w ′ ) = -G ′′ [f η η](w, w ′ )f η (w ′ ). (D.

Formula formula_138: sup w∈W sup η∈P(W) sup w ′ ∈W ∇ w ′ δf η (w) δη(dw ′ ) w ′ ≤ H

Formula formula_139: λ s ′ , ∇ w ′ δf η (w) δη(w ′ ) w ′ + G ′′ [f η η](w, w ′′ )η(dw ′′ ) s ′ , ∇ w ′ δf η (w ′′ ) δη(w ′ ) w ′ = -⟨s ′ , ∇ w ′ [G ′′ [f η η](w, w ′ )f η (w ′ )]⟩ w ′

Formula formula_140: sup w∈W sup s∈TwW ∥s∥ w =1 sup η∈P(W) sup w ′ ∈W ∇ w ′ δ ⟨s, ∇f η (w)⟩ w δη(dw ′ ) w ′ ≤ H ′

Formula formula_141: λ s ′ , ∇ w ′ δ ⟨s, ∇f η (w)⟩ w δη(w ′ ) w ′ + ∇ w G ′′ [f η η](w, w ′′ )η(dw ′′ ) s ′ , ∇ w ′ δf η (w ′′ ) δη(w ′ ) w ′ = -s, ∇ w ⟨s ′ , ∇ w ′ [G ′′ [f η η](w, w ′ )f η (w ′ )]⟩ w ′ w

Formula formula_142: λ ∇ w ′ δ ⟨s, ∇f η (w)⟩ w δη(dw ′ ) w ′ ≤ ∥∇ w ∇ w ′ G ′′ [f η η]∥ • |f η (w ′ )| + ∥∇ w G ′′ [f η η]∥ w • ∥∇f η (w ′ )∥ w ′ + sup w ′′ ∈W ∥∇ w G ′′ [f η η](w, w ′′ )∥ w • sup w ′′ ∈W ∇ w ′ δf η (w ′′ ) δη(dw ′ ) w ′ ≤ L 2 • 1 λ 2L 0 G(0) + L 1 • L 1 λ 2 2L 0 G(0) + B 1 λ + L 1 • H =: H ′ by Assumption 1.

Formula formula_143: |f η (w) -f η ′ (w)| ≤ sup η ′′ ∈P(W) sup w ′ ∈W ∇ w ′ δf η ′′ (w) δη ′′ (dw ′ ) w ′ W 2 (η, η ′ ) ≤ HW 2 (η, η ′ ).

Formula formula_144: ∇f η ′ (w)-∇fη(w) ∥∇f η ′ (w)-∇fη(w)∥ w ∈ T w W. Then by Lem. D.8 below applied to F (η) = ⟨s, ∇f η (w)⟩ w , ∥∇f η ′ (w) -∇f η (w)∥ = ⟨s, ∇f η ′ (w)⟩ w -⟨s, ∇f η (w)⟩ w ≤ H ′ W 2 (η, η ′ ).

Formula formula_145: ∀η ∈ P(W), ∀w ∈ W, ∥∇F ′ [η](w)∥ w ≤ B. Then ∀η, η ′ ∈ P(W), |F (η) -F (η ′ )| ≤ BW 1 (η, η ′ ) ≤ BW 2 (η, η ′ ).

Formula formula_146: d dθ F (η θ ) = W F ′ [η θ ]d (∂ θ η θ )

Formula formula_147: ∀φ : W → R, d dθ W φdη θ = d dθ W×W φ(Σ θ (x, y))dγ(x, y) = W×W d dθ φ(Σ θ (x, y))dγ(x, y) = W×W ⟨Σ ′ θ (x, y), ∇φ(Σ θ (x, y))⟩ Σ θ (x,y) dγ(x, y).

Formula formula_148: d dθ F (η θ ) = W×W ⟨Σ ′ θ (x, y), ∇F ′ [η θ ](Σ θ (x, y))⟩ Σ θ (x,y) dγ(x, y) d dθ F (η θ ) ≤ W×W ∥Σ ′ θ (x, y)∥ Σ θ (x,y) • ∥∇F ′ [η θ ](Σ θ (x, y))∥ Σ θ (x,y) dγ(x, y) ≤ sup w∈W sup η ′ ∈P(W) ∥∇F ′ [η](w)∥ w • W×W ∥Σ ′ θ (x, y)∥ Σ θ (x,y) dγ(x, y) ≤ B • W×W dist(x, y)dγ(x, y) = BW 1 (η, η ′ )

Formula formula_149: |F (η) -F (η ′ )| = 1 0 d dθ F (η θ ) dθ ≤ sup θ∈[0,1] d dθ F (η θ ) ≤ BW 1 (η, η ′ ).

Formula formula_150: ′ λ [η](w) = -λ 2 |f η | 2 (w) with f η = -1 λ G ′ [f η η] over W.

Formula formula_151: ∀η ∈ P 2 (W), ∀w ∈ W, max s∈TwW ∥s∥ w ≤1 ∇ 2 J ′ λ [η](s, s) ≤ Λ

Formula formula_152: ∇J ′ λ [η](w) = -λf η (w)∇f η (w) ∇ 2 J ′ λ [η](w) = -λf η (w)∇ 2 f η (w) -λ∇f η (w)∇ ⊤ f η (w)

Formula formula_153: ∇ 2 J ′ λ [η](s, s) ≤ λ |f η | ∇ 2 f η + λ ∥∇f η ∥ 2 ≤ 2L 0 G(0) L 2 λ 2 2L 0 G(0) + B 2 λ + λ L 1 λ 2 2L 0 G(0) + B 1 λ 2 by Lem. D.4.

Formula formula_154: ∀w ∈ W, ∀η, η ′ ∈ P 2 (W), ∥∇J ′ λ [η] -∇J ′ λ [η ′ ]∥ w ≤ Λ W 2 (η, η ′ ) for some Λ < ∞. Indeed, ∥∇J ′ λ [η] -∇J ′ λ [η ′ ]∥ w = λ ∥f η ∇f η -f η ′ ∇f η ′ ∥ ≤ λ (∥f η (∇f η -∇f η ′ )∥ + ∥(f η -f η ′ )∇f η ′ ∥) ≤ λ sup η ′′ sup W |f η ′′ | • sup W ∥∇f η -∇f η ′ ∥ + sup η ′′ sup W ∥∇f η ′′ ∥ • sup W |f η -f η ′ | ≤ λ 1 λ 2L 0 G(0) • H ′ W 2 (η, η ′ ) + L 1 λ 2 2L 0 G(0) + B 1 λ • HW 2 (η, η ′ ) =: ΛW 2 (η, η ′ )

Formula formula_155: |J ′ λ [η](w)| = λ 2 |f η | 2 (w) ≤ L 0 λ J λ (η).

Formula formula_156: J λ (η t ) ≤ J λ (η t ) + β -1 H (η t |τ ) ≤ J λ (η 0 ) + β -1 H (η 0 |τ )

Formula formula_157: ∀η s.t. W 1 (η, η * ) ≤ A, J (η) -J (η * ) ≤ BW ∞ (η, η * ). Denote J β = J + β -1 H (•|τ ), for any β > 0. Then min η:W1(η,η * )≤A J β (η) ≤ J (η * ) + inf 0<ϵ≤min{1,A} Bϵ + d β log 1 ϵ + log C β where C := inf w∈W inf 0<ϵ≤1 ϵ -d • τ ({w ′ ; dist W (w, w ′ ) ≤ ϵ}) -1 .

Formula formula_158: dη ϵ dτ (w ′ ) = w∈W dγ ϵ,w dτ (w ′ )η * (dw) = w∈W 1(w ′ ∈ B ϵ (w)) τ (B ϵ (w)) η * (dw).

Formula formula_159: C such that τ (B ϵ (w)) ≥ C -1 ϵ d for all ϵ ≤ 1 [GV79, Theorem 3.3]. As a consequence, H (η ϵ |τ ) = dη ϵ (w ′ ) log dη ϵ dτ (w ′ ) ≤ sup w∈W -log τ (B ε (w)) ≤ d log(1/ϵ) + log C.

Formula formula_160: W 1 (η ϵ , η * ) ≤ W ∞ (η ϵ , η * ) ≤ ϵ ≤ A.

Formula formula_161: min η:W1(η,η * )≤A J β (η) ≤ J β (η ϵ ) = J (η ϵ ) + β -1 H (η ϵ |τ ) ≤ J (η * ) + Bϵ + β -1 (d log(1/ϵ) + log C) ,

Formula formula_162: T ∆ ≤ 2d α τ ∆J * λ log 4C 1/d B ∆J * λ • exp 4dL 0 G(0) λ∆J * λ log 4C 1/d B ∆J * λ • log 2J λ (η 0 ) ∆J * λ + H (η 0 |τ ) 2 log C where C = max 1, inf w∈W inf 0<ϵ≤1 ϵ -d • τ ({w ′ ; dist W (w, w ′ ) ≤ ϵ}) -1 .

Formula formula_163: J λ (η t ) ≤ J λ,β (η t ) ≤ inf J λ,β + e -2β -1 α β t (J λ,β (η 0 ) -inf J λ,β ) ≤ inf J λ,β + e -2β -1 α β t J λ,β (η 0 ),

Formula formula_164: J λ = β -1 H (•|τ ) ≥ 0.

Formula formula_165: J = J λ , η * = arg min J λ , A = ∞ and B = 2L 0 G(0) • L1 λ 2 2L 0 G(0) + B1 λ

Formula formula_166: inf J λ,β ≤ inf J λ + inf 0<ϵ≤1 Bϵ + d β log 1 ϵ + log C β .

Formula formula_167: inf J λ,β ≤ J * λ + d + log C ′ β - d β log d βB .

Formula formula_168: J * λ + d + log C ′ β - d β log d βB + e -2β -1 α β t J λ (η 0 ) + β -1 H (η 0 |τ ) ≤ (1 + ∆)J * λ i.e. t ≥ β 2α β log   J λ (η 0 ) + β -1 H (η 0 |τ ) ∆J * λ -d+log C ′ β -d β log d βB   =: T s ,

Formula formula_169: T s = β 2α τ • exp 1 λ L 0 βG(0) • log   J λ (η 0 ) + β -1 H (η 0 |τ ) ∆J * λ -d+log C ′ β -d β log d βB   = sd/B 2α τ • exp s 1 λB L 0 dG(0) • log J λ (η 0 ) + B sd H (η 0 |τ ) ∆J * λ -B s (1 + d -1 log C ′ + log s)

Formula formula_170: log s∆J * λ 4B = log s -log 4B ∆J * λ ≤ s∆J * λ 4B -1 so B s 1 + d -1 log C ′ + log s ≤ B s d -1 log C ′ + log 4B ∆J * λ + s∆J * λ 4B = B s d -1 log C ′ + log 4B ∆J * λ + ∆J * λ 4 , choose henceforth s = max 1, 4B ∆J * λ d -1 log C ′ + log 4B ∆J * λ , so that ∆J * λ - B s 1 + d -1 log C ′ + log s ≥ ∆J * λ 2 .

Formula formula_171: ≤ 4B ∆J * λ d -1 log C ′ + log 4B ∆J * λ .

Formula formula_172: C ′ ≥ 1, 1 ≤ 4B ∆J * λ d -1 log C ′ + log 4B ∆J * λ ⇐⇒ ∆J * λ 4B + log ∆J * λ 4B ≤ d -1 log C ′ ⇐= ∆J * λ 4B ≤ 1 and log ∆J * λ 4B ≤ -1 ⇐⇒ ∆J * λ 4B ≤ min{1, e -1 } = e -1 ⇐⇒ ∆ ≤ 4Be -1 J * λ = 4e -1 J * λ • 2L 0 G(0) L 1 λ 2 2L 0 G(0) + B 1 λ ⇐= ∆ ≤ 4e -1 J * λ • 2L 0 L 1 G(0) λ 2 ⇐= ∆ ≤ 1 J * λ • 2L 0 L 1 G(0) λ 2 . Then s = 4B ∆J * λ d -1 log C ′ + log 4B ∆J * λ , β = 4d ∆J * λ d -1 log C ′ + log 4B ∆J * λ ≥ 4 ∆J * λ log C ′ ,and

Formula formula_173: T s ≤ β 2α τ • exp 1 λ L 0 βG(0) • log J λ (η 0 ) + β -1 H (η 0 |τ ) ∆J * λ /2 ≤ 2d α τ ∆J * λ log 4C ′1/d B ∆J * λ • exp 4dL 0 G(0) λ∆J * λ log 4C ′1/d B ∆J * λ • log 2J λ (η 0 ) ∆J * λ + H (η 0 |τ ) 2 log C ′ =: T ∆ .

Formula formula_174: β k = 2 k β 0 4: Run the MFLD with β k initialized from η k 0 up to T k , ∂ t η k t = div(η k t ∇J ′ [η k t ]) + 1 β k ∆η k t .

Formula formula_175: η k+1 0 = η k T k 6: end for 7: return η K T K .

Formula formula_176: κ 1 , C L , A > 0 such that 1. ∥J ′ [η]∥ ∞ ≤ κ 1 J (η) for all η ∈ P(W). 2. J (η) -J (η * ) ≤ C L W ∞ (η, η * ) for all η ∈ P(W) such that W 1 (η, η * ) ≤ A. Fix 0 < δ ≤ C L min{1,A} J *

Formula formula_177: T k = 2 k-1 d log 2 k J β0 (η 0 ) •α -1 τ exp 2κ 1 d δ -1 + log C L C 1/d δJ * + 2 + J β0 (η 0 ) 2 (E.1)

Formula formula_178: C := inf w∈W inf 0<ϵ≤1 ϵ -d • τ ({w ′ ; dist W (w, w ′ ) ≤ ϵ}) -1 . Then J (η K T K ) ≤ J * 1 + 3δ + 2δ log C L C 1/d δJ *

Formula formula_179: K k=0 T k ≤ d δJ * log J β0 (η 0 ) δJ * • α -1 τ exp 2κ 1 d δ -1 + log C L C 1/d δJ * + 2 + J β0 (η 0 ) 2 .

Formula formula_180: J ′ [η] : W → R is C L -Lipschitz for all η ∈ P(W), as shown in Lem. D.8, since W 1 ≤ W 2 ≤ W ∞ .

Formula formula_181: ∥J ′ [η]∥ 2 ∞ ≤ 2L (J (η) -J * ) ≤ 2LJ (η) ≤ 2L J (η) 2 J * .

Formula formula_182: Proof of Thm. E.3. Fix any 0 < δ ≤ C L min{1,A} J * . Let, for any β > 0, J β = J + β -1 H (•|τ ).

Formula formula_183: α τ exp -β k κ 1 J (η k t ) ≥ inf t ′ ≥0 α τ exp -β k κ 1 J (η k t ′ ) =: α(k).

Formula formula_184: k, t, J (η k t ) ≤ J β k (η k t ) ≤ J β k (η k 0 ), since H (•|τ ) is non-negative and (η k t ) t is a Wasserstein gradient flow of J β k , and so α(k) = inf t≥0 α τ exp -β k κ 1 J (η k t ) ≥ α τ exp -β k κ 1 J β k (η k 0 )

Formula formula_185: T k = β k 2α(k) log β k d c k for some α(k) ≤ α(k) and c k ≥ J β k (η k 0 ) -min J β k to be chosen.

Formula formula_186: J β k (η k T k ) ≤ min J β k + exp -2β -1 k α(k)T k • J β k (η k 0 ) -min J β k ≤ min J β k + β k d J β k (η k 0 ) -min J β k -1 • J β k (η k 0 ) -min J β k = min J β k + d β k .

Formula formula_187: J β k (η k T k ) ≤ J * + inf 0<ϵ≤min{1,A} C L ϵ + d β k log 1 ϵ + log C β k + d β k ≤ J * (1 + δ) + d β k log C L δJ * + d + log C β k , (E.2)

Formula formula_188: = δJ * C L ≤ min{1, A} since δ ≤ C L min{1,A} J * .

Formula formula_189: β k J (η k t ) ≤ β k J β k (η k t ) ≤ β k J β k (η k 0 ) = β k J β k (η k-1 T k-1 ) ≤ β k J β k-1 (η k-1 T k-1 ) = 2β k-1 J β k-1 (η k-1 T k-1 ),

Formula formula_190: J β k -J = β -1 k H (•|τ ) ≥ 0, that (η k t ) t is a Wasserstein gradient flow for J β k , that J β k-1 -J β k = (β -1 k-1 -β -1 k )H (•|τ ) ≥ 0 since (β k ) k is increasing, and that by definition β k = 2 k β 0 . So by (E.2), β k J (η k t ) ≤ 2β k-1 J β k-1 (η k-1 T k-1 ) ≤ 2β k-1 J * (1 + δ) + 2d log C L δJ * + 2d + 2 log C ≤ 2 d δ (1 + δ) + 2d log C L δJ * + 2d + 2 log C = 2d δ -1 + log C L δJ * + 2 + log C d since our choice of β 0 = d and K = ⌈log 2 (1/(δJ * ))⌉ ensures that β k-1 ≤ β K = 2 K β 0 ≤ d δJ * .

Formula formula_191: ∀t ≥ 0, β k J (η k t ) ≤ 2d δ -1 + log C L δJ * + 2 + log C d + 1 2 J β0 (η 0 )

Formula formula_192: α(k) = inf t≥0 α τ exp -κ 1 β k J (η k t ) ≥ α τ exp -2κ 1 d δ -1 + log C L δJ * + 2 + log C d + 1 2 J β0 (η 0 ) =: α(k).

Formula formula_193: J β k (η k 0 ) = J β k (η k-1 T k-1 ) ≤ J β k-1 (η k-1 T k-1 ) ≤ J β k-1 (η k-1 0 ) ≤ ... ≤ J β0 (η 0 ) by induction, so J β k (η k 0 ) -min J β k ≤ J β0 (η 0 ) =: c k . Therefore, more explicitly, T k = β k 2α(k) log β k d c k = β k 2 log β k d J β0 (η 0 ) • α -1 τ exp 2κ 1 d δ -1 + log C L δJ * + 2 + log C d + 1 2 J β0 (η 0 ) = 2 k-1 d • log 2 k J β0 (η 0 ) • α -1 τ exp 2κ 1 d δ -1 + log C L δJ * + 2 + log C d + 1 2 J β0 (η 0 ) since β k = 2 k β 0 = 2 k d. Note that K k=0 2 k-1 log 2 k J β0 (η 0 ) = K k=0 2 k log J β0 (η 0 ) 2 + K k=0 k2 k-1 log(2) = (2 K+1 -1) log J β0 (η 0 ) 2 + log(2) (K -1)2 K + 1 ≤ 2 K log J β0 (η 0 ) + log(2)K2 K ≤ 1 δJ * log J β0 (η 0 ) + 1 δJ * log 1 δJ * = 1 δJ * log J β0 (η 0 ) δJ * since K = ⌈log 2 (1/(δJ * ))⌉

Formula formula_194: K k=0 T k . Finally, at round K = ⌈log 2 (1/(δJ * ))⌉, then β K = 2 K β 0 = 2 K d ∈ 1 2 d δJ * , d δJ * , so by (E.2), J (η K T K ) ≤ J β K (η K T K ) ≤ J * (1 + δ) + d β K log C L δJ * + d + log C β K ≤ J * 1 + 3δ + 2δ log(C) d + 2δ log C L δJ * ,

Formula formula_195: t k+1 -t k = 2 k-1 d log 2 k J λ,β0 (η 0 ) •α -1 τ exp 2L 0 d λ δ -1 + log BC 1/d δJ * λ + 2 + J λ,β0 (η 0 ) 2 ,

Formula formula_196: T ∆ ≤ t K+1 ≤ d δJ * λ log J λ,β0 (η 0 ) δJ * λ •α -1 τ exp 2L 0 d λ δ -1 + log BC 1/d δJ * λ + 2 + J λ,β0 (η 0 ) 2 .

Formula formula_197: κ 1 = L0 λ , i.e. ∥J ′ λ [η]∥ ∞ ≤ L0 λ J λ (η)

Formula formula_198: C L = B := 2L 0 G(0) • L1 λ 2 2L 0 G(0) + B1 λ , as shown in Lem. D.5, since W 1 ≤ W 2 ≤ W ∞ .

Formula formula_199: log dη dη ′ (w) + (log Z η -log Z η ′ ) = β |J ′ λ [η](w) -J ′ λ [η ′ ](w)| = β λ 2 f η (w) 2 -f η ′ (w) 2 ≤ β λ 2 (|f η | + |f η ′ |) (w) • |f η -f η ′ | (w) ≤ β λ 2 • 2 1 λ 2L 0 G(0) • HW 2 (η, η ′ ) =: HW 2 (η, η ′ )

Formula formula_200: λ -1 , G(0), L 0 , L 1 , B 1 , L 2 . Now suppose that η λ,β = arg min J λ,β = η λ,β satisfies α * -LSI. Let ε > 0 and η 0 in the δ-sublevel set of J λ,β , i.e., η 0 ∈ S δ := J -1 λ,β ((-∞, inf J λ,β + δ]

Formula formula_201: β -1 H (η|η λ,β ) ≤ J λ,β (η) -inf J λ,β ≤ δ.

Formula formula_202: ∀η ′ , W 2 (η ′ , η λ,β ) ≤ 2 α * H (η ′ |η λ,β ).

Formula formula_203: log dη dη λ,β (w) + c ≤ HW 2 (η, η λ,β ) ≤ H 2 α * H (η|η λ,β ) ≤ H 2 α * • βδ =: M √ δ

Formula formula_204: J λ (η) = λ 2 ⟨y, (K η + λ id) -1 y⟩ H , J ′ λ [η](w) = - λ 2 ⟨ϕ(w), (K η + λ id) -1 y⟩ 2 H , with K η = ϕ(w)ϕ(w) * dη(w), where * denotes adjoint in H. More explicitly, K η is the integral operator of the kernel k η (x, x ′ ) = φ(⟨w, x⟩)φ(⟨w, x ′ ⟩)dη(w) with respect to the distribution x ∼ ρ, i.e., ∀h ∈ H = L 2 ρ (R d+1 ), (K η h)(x) = E x ′ ∼ρ [k η (x, x ′ )h(x ′ )] in L 2 ρ .

Formula formula_205: G(ν) = 1 2 E x∼ρ W φ(⟨w, x⟩)dν(w) -y(x) 2 = 1 2 W ϕ(w)dν(w) -y 2 H

Formula formula_206: min f ∈L 2 η (W) 1 2 W ϕ(w)f (w)dη(w) -y 2 H + λ 2 W |f | 2 (w)dη(w).

Formula formula_207: f (w) = -1 λ ϕf dη -y, ϕ(w) H in L 2 η (W). In particular, denoting ĥη = -1 λ ϕf η dη -y , then f η (w) = ĥη , ϕ(w) H and, integrating against ϕη, W f η (w)ϕ(w)dη(w) = W ϕ(w) ϕ(w) * ĥη dη(w) ⇐⇒ -λ ĥη + y = K η ĥη ⇐⇒ (K η + λ id) ĥη = y ⇐⇒ ĥη = (K η + λ id) -1 y,

Formula formula_208: J λ (η) = 1 2 W ϕ(w)f η (w)dη(w) -y 2 H + λ 2 W |f η | 2 (w)dη(w) (F.1) = 1 2 λ ĥη 2 H + λ 2 W ĥ * η ϕ(w) ϕ(w) * ĥη dη(w) = 1 2 λ ĥη , λ ĥη H + λ 2 ĥη , K η ĥη H = 1 2 λ ĥη , λ ĥη + K η ĥη H = 1 2 λ ĥη , y H = λ 2 ⟨y, (K η + λ id) -1 y⟩ H .

Formula formula_209: f η ∈ L 2 η (W) into a function W → R), we then have ∀w ∈ W, J ′ λ [η](w) = ϕf η dη -y, ϕ(w)f η (w) H + λ 2 |f η | 2 (w) = f η (w) -λ ĥη , ϕ(w) H + λ 2 |f η | 2 (w) = -λ |f η | 2 (w) + λ 2 |f η | 2 (w) = - λ 2 |f η | 2 (w) = - λ 2 ĥη , ϕ(w) 2 H

Formula formula_210: L 2 ρ (R d+1 ) of the kernel k η (x, x ′ ) = W ϕ(w)(x) ϕ(w)(x ′ )dη(w) follows directly from the definition K η = W ϕ(w)ϕ(w) * dη(w), since ∀h ∈ H, K η h = W ϕ(w) ⟨ϕ(w), h⟩ H dη(w), (K η h)(x) = W ϕ(w)(x) E x ′ ∼ρ [ϕ(w)(x ′ )h(x ′ )] dη(w) = E x ′ ∼ρ W ϕ(w)(x)ϕ(w)(x ′ ) h(x ′ ) dη(w) = E x ′ ∼ρ [k η (x, x ′ )h(x ′ )] .

Formula formula_211: |J λ (η) -J λ (η ′ )| ≤ B 0 B 1 λ ∥y∥ 2 H • W 1 (η, η ′ ) and |J ′ λ [η](w) -J ′ λ [η ′ ](w)| ≤ 2B 3 0 B 1 λ 2 ∥y∥ 2 H • W 1 (η, η ′ ) and ∥∇J ′ λ [η](w) -∇J ′ λ [η ′ ](w)∥ w ≤ 4B 2 0 B 2 1 λ 2 ∥y∥ 2 H • W 1 (η, η ′ ) and ∇ 2 J ′ λ [η](w) -∇ 2 J ′ λ [η ′ ](w) op w ≤ 4B 0 B 1 (B 0 B 2 + B 2 1 ) λ 2 ∥y∥ 2 H • W 1 (η, η ′ ).

Formula formula_212: J ′ λ [η](w) = - λ 2 ϕ(w), (K η + λ id) -1 y 2 H where K η = W ϕ(w ′′ )ϕ(w ′′ ) * dη(w ′′ ) so ∇J ′ λ [η](w) = -λ ϕ(w), (K η + λ id) -1 y H ∇ϕ(w), (K η + λ id) -1 y H (F.2) ∥∇J ′ λ [η]∥ w ≤ λ ∥ϕ(w)∥ H ∥∇ϕ(w)∥ w (K η + λ id) -1 y 2 H ≤ λB 0 B 1 ∥y∥ 2 H (K η + λ) -1 2 op ≤ 1 λ B 0 B 1 ∥y∥ 2 H

Formula formula_213: (K η + λ) -1 op = σ max ((K η + λ id) -1 ) = [σ min (K η + λ id)]

Formula formula_214: δ δη(w ′ ) (K η + λ id) -1 = -(K η + λ id) -1 • ϕ(w ′ )ϕ(w ′ ) * • (K η + λ id) -1 ,

Formula formula_215: J ′′ λ [η](w, w ′ ) = -λ ϕ(w), (K η + λ id) -1 y H ϕ(w), -(K η + λ id) -1 • ϕ(w ′ )ϕ(w ′ ) * • (K η + λ id) -1 y H = -λ ⟨ϕ(w), M y⟩ H ⟨ϕ(w), -M • ϕ(w ′ )ϕ(w ′ ) * • M y⟩ H = λ ⟨ϕ(w), M y⟩ H ⟨ϕ(w), M ϕ(w ′ )⟩ H ⟨ϕ(w ′ ), M y⟩ H .

Formula formula_216: ∇ w J ′′ λ [η](w, w ′ ) = λ ⟨ϕ(w ′ ), M y⟩ H • (⟨∇ϕ(w), M y⟩ • ⟨ϕ(w), M ϕ(w ′ )⟩ H + ⟨ϕ(w), M y⟩ • ⟨∇ϕ(w), M ϕ(w ′ )⟩ H )

Formula formula_217: ∥M ∥ op = (K η + λ) -1 op ≤ λ -1 , ∥∇ w J ′′ λ (w, w ′ )∥ w ≤ λB 0 λ -1 ∥y∥ H • 2B 2 0 B 1 λ -2 ∥y∥ H = 2λ -2 B 3 0 B 1 ∥y∥ 2 H .

Formula formula_218: ∇ w J ′′ λ [η](w, w ′ ) derived above, ∇ w ′ ∇ w J ′′ λ [η](w, w ′ ) = λ ⟨∇ϕ(w ′ ), M y⟩ H • (⟨∇ϕ(w), M y⟩ • ⟨ϕ(w), M ϕ(w ′ )⟩ H + ⟨ϕ(w), M y⟩ • ⟨∇ϕ(w), M ϕ(w ′ )⟩ H ) + λ ⟨ϕ(w ′ ), M y⟩ H • (⟨∇ϕ(w), M y⟩ • ⟨ϕ(w), M ∇ϕ(w ′ )⟩ H + ⟨ϕ(w), M y⟩ • ⟨∇ϕ(w), M ∇ϕ(w ′ )⟩ H ) , so ∥∇ w ′ ∇ w J ′′ λ [η](w, w ′ )∥ ≤ 4λ -2 B 2 0 B 2 1 ∥y∥ 2 H

Formula formula_219: ∇ w ′ ∇ 2 w J ′′ λ [η](w, w ′ ) = λ ⟨∇ϕ(w ′ ), M y⟩ H • ∇ 2 ϕ(w), M y • ⟨ϕ(w), M ϕ(w ′ )⟩ H + ⟨∇ϕ(w), M y⟩ • ⟨∇ϕ(w), M ϕ(w ′ )⟩ H + λ ⟨∇ϕ(w ′ ), M y⟩ H • ⟨∇ϕ(w), M y⟩ • ⟨∇ϕ(w), M ϕ(w ′ )⟩ H + ⟨ϕ(w), M y⟩ • ∇ 2 ϕ(w), M ϕ(w ′ ) H + λ ⟨ϕ(w ′ ), M y⟩ H • ∇ 2 ϕ(w), M y • ⟨ϕ(w), M ∇ϕ(w ′ )⟩ H + ⟨∇ϕ(w), M y⟩ • ⟨∇ϕ(w), M ∇ϕ(w ′ )⟩ H + λ ⟨ϕ(w ′ ), M y⟩ H • ⟨∇ϕ(w), M y⟩ • ⟨∇ϕ(w), M ∇ϕ(w ′ )⟩ H + ⟨ϕ(w), M y⟩ • ∇ 2 ϕ(w), M ∇ϕ(w ′ ) H , hence ∇ w ′ ∇ 2 w J ′′ λ [η](w, w ′ ) ≤ λ -2 ∥y∥ 2 B 0 B 1 (4B 2 B 0 + 4B 2 1 )

Formula formula_220: ∇ 2 J ′ λ [η](w) • s w for s ∈ T w W arbitrary.

Formula formula_221: W → H = L 2 ρ (R d+1

Formula formula_222: sup w∈S d ∥ϕ(w)∥ H ≤ ∥φ∥ L 2 (ρ) , sup w∈S d ∥∇ϕ(w)∥ H ≤ ∥φ ′ ∥ L 4 (ρ) N 4 (ρ), sup w∈S d ∇ 2 ϕ(w) H ≤ ∥φ ′′ ∥ L 4 (ρ) + ∥φ ′ ∥ L 4 (ρ) N 4 (ρ)

Formula formula_223: N 4 (ρ) := sup ∥u∥ 2 ≤1 E x∼ρ ⟨u, x⟩ 4 1/4

Formula formula_224: w∈S d (E x∼ρ |f (⟨w, x⟩)| p ) 1/p .

Formula formula_225: universal constant c such that N 4 (ρ) ≤ cd -1/2 E x∼ρ ∥x∥ 4 1/4 .

Formula formula_226: sup w ∥φ(w)∥ H = sup w E x∼ρ |φ(⟨w, x⟩)| 2 = ∥φ∥ L 2 (ρ) .

Formula formula_227: ∥∇ϕ(w)∥ H = sup ∥f ∥ L 2 (ρ) ≤1 sup s∈TwS d ∥s∥ w =1 E x∼ρ [f (x) ⟨s, ∇ϕ(w)(x)⟩ w ] = sup s∈TwS d ∥s∥ w =1 E x∼ρ ⟨s, ∇ϕ(w)(x)⟩ 2 w 1/2 = sup s∈TwS d ∥s∥ w =1 E x∼ρ |φ ′ (⟨w, x⟩)| 2 ⟨Π w s, x⟩ 2 1/2 ≤ E x∼ρ |φ ′ (⟨w, x⟩)| 4 1/4

Formula formula_228: ∥u∥ 2 =1 E x∼ρ ⟨u, x⟩ 4 1/4

Formula formula_229: ∇ 2 ϕ(w) (x) = ∇ 2 w φ(⟨w, x⟩) = ∇ ⊤ w [φ ′ (⟨w, x⟩)Π w x] = Π w φ ′′ (⟨w, x⟩)xx ⊤ -φ ′ (⟨w, x⟩) ⟨w, x⟩ Π w ,

Formula formula_230: ∇ 2 ϕ(w) H ≤ sup s∈TwS d ∥s∥ w =1 E x∼ρ |φ ′′ (⟨w, x⟩)| 2 ⟨s, Π w x⟩ 2 1/2 + E x∼ρ |φ ′ (⟨w, x⟩)| 2 ⟨w, x⟩ 2 1/2 ≤ E x∼ρ |φ ′′ (⟨w, x⟩)| 4 1/4

Formula formula_231: s∈TwS d ∥s∥ w =1 E x∼ρ ⟨Π w s, x⟩ 4 1/4 + E x∼ρ |φ ′ (⟨w, x⟩)| 4 1/4 E x∼ρ ⟨w, x⟩4 1/4

Formula formula_232: ) ≤ cd -1/2 E x∼ρ ∥x∥ 4 1/4

Formula formula_233: N 4 4 (ρ) = sup ∥u∥ 2 ≤1 E x∼ρ ∥x∥ 4 ⟨u, x/ ∥x∥⟩ 4 = sup ∥u∥ 2 ≤1 E x∼ρ ∥x∥ 4 • E x∼τ ⟨u, x⟩ 4 ,

Formula formula_234: sup ∥u∥ 2 ≤1 E x∼τ ⟨u, x⟩ 4 ≤ c(

Formula formula_235: ′ ∈ S d , G ′ [ν](w) = ϕ(w), W ϕ(w ′ )dν(w ′ ) -y H G ′′ [ν](w, w ′ ) = ⟨ϕ(w), ϕ(w ′ )⟩ H and ∇ w G ′′ [ν](w, w ′ ) = ⟨∇ϕ(w), ϕ(w ′ )⟩ H ∇ 2 w G ′′ [ν](w, w ′ ) = ∇ 2 ϕ(w), ϕ(w ′ ) H ∇ w ∇ w ′ G ′′ [ν](w, w ′ ) = ⟨∇ϕ(w), ∇ϕ(w ′ )⟩ H .

Formula formula_236: |G ′′ [ν](w, w ′ )| ≤ C 2 0 =: L 0 ∥∇ w G ′′ [ν](w, w ′ )∥ w ≤ C 0 C 1 =: L 1 ∇ 2 w G ′′ [ν](w, w ′ ) w ≤ C 0 C 2 =: L 2 ∥∇ w ∇ w ′ G ′′ [ν](w, w ′ )∥ ≤ C 2 1 =: L 2 . Now for each i ∈ {0, 1, 2}, ∀(ν, w, w ′ ), ∇ i w G ′′ [ν](w, w ′ ) w ≤ L i =⇒ ∀(ν, ν ′ , w), ∇ i G ′ [ν] -∇ i G ′ [ν ′ ] w ≤ L i ∥ν -ν ′ ∥ T V .

Formula formula_237: g(θ) = s, ∇ i G ′ [ν + θ(ν ′ -ν)](w) w over θ ∈ [0, 1] for each s ∈ (T w W) ⊗i . Thus, to show the exis- tence of B i < ∞ such that ∀(ν, w, w ′ ), ∇ i G ′ [ν] w ≤ L i ∥ν∥ T V + B i ,

Formula formula_238: ∇ i G ′ [ν](w) = ∇ i ϕ(w), W ϕ(w ′ )dν(w ′ ) -y H , thus ∇ i G ′ [0](w) = -∇ i ϕ(w), y H and sup w ∇ i G ′ [0](w) w ≤ C i ∥y∥ H < ∞.

Formula formula_239: J ′ λ [δ v ] as a proxy of J ′ λ [η λ,β ],

Formula formula_240: J ′ λ [δ v ] over to J ′ λ [η λ,β ],

Formula formula_241: ∀w ∈ S d , J ′ λ [δ v ](w) = - λ 2 λ + ∥ϕ(v)∥ 2 H -2 ⟨ϕ(v), ϕ(w)⟩ 2 H = - λ 2 λ + ∥φ∥ 2 L 2 (ρ) -2 |E x∼ρ φ(⟨x, v⟩)φ(⟨x, w⟩)| 2 = -λg(⟨v, w⟩)

Formula formula_242: J ′ λ [δ v ] = - λ 2 ϕ(w), (K δv + λ id) -1 ϕ(v) 2 H . Since ϕ(v) is an eigenvector of K δv = W ϕ(w ′ )ϕ(w ′ ) * dδ v = ϕ(v)ϕ(v) * with eigenvalue ∥ϕ(v)∥ 2 H = E x∼ρ φ(⟨x, v⟩) 2 = ∥φ∥ 2 L 2 (ρ)

Formula formula_243: β ≥ D 0 dλ -1 then ∀w ∈ S d , 1 2 ∆f - β 4 ∥∇f ∥ 2 ≤ D 1 λd (L S d ) ∀w ∈ S d \ U, 1 2 ∆f - β 4 ∥∇f ∥ 2 ≤ -D 2 βλ 2 (L U ) ∀w ∈ S d , λ min (∇ 2 f (w)) ≥ -D 3 λ (C S d ) ∀w ∈ U, λ min (∇ 2 f (w)) ≥ D 4 λ (C U )

Formula formula_244: ∀w ∈ S d , 1 2 ∆f - β 4 ∥∇f ∥ 2 ≤ D ′ 1 λdβ 3/4 , (L ′ S d )

Formula formula_245: κ ≥ D 2 β 2 λ 2 1 + D1λβd+D2β 2 λ 2 d-1+βλD4

Formula formula_246: f 0 := J ′ λ [δ v ]

Formula formula_247: ∥∇f 0 (w)∥ 2 = λ 2 g ′ (⟨w, v⟩) 2 (1 -⟨w, v⟩ 2 )

Formula formula_248: 1 2 ∆f 0 - β 4 ∥∇f 0 ∥ 2 = - λ 4 2g ′′ (⟨w, v⟩) + βλg ′ (⟨w, v⟩) 2 (1 -⟨w, v⟩ 2 ) + λ 2 g ′ (⟨w, v⟩)⟨w, v⟩d.

Formula formula_249: inf [-1,1] 2g ′′ + βλ(g ′ ) 2 ≥ 0 ⇐= 2(inf g ′′ ) + βλ(inf g ′ ) 2 ≥ 0 ⇐= -2C 2 + βλc 2 1 ≥ 0 ⇐⇒ β ≥ 2C 2 c 2 1 λ -1 .

Formula formula_250: β ≥ 4C 2 c 2 1 λ =⇒ inf [-1,1] 2g ′′ + β 2 λ(g ′ ) 2 ≥ 0 =⇒ 2g ′′ + βλ(g ′ ) 2 ≥ β 2 λ(g ′ ) 2 over [-1, 1].

Formula formula_251: 1 2 ∆f 0 - β 4 ∥∇f 0 ∥ 2 ≤ - λ 4 1 2 βλg ′ (⟨w, v⟩) 2 (1 -⟨w, v⟩ 2 ) + λ 2 g ′ (⟨w, v⟩)⟨w, v⟩d = λ 4 g ′ (⟨w, v⟩) - βλ 2 g ′ (⟨w, v⟩)(1 -⟨w, v⟩ 2 ) + 2⟨w, v⟩d ≤ λ 4 g ′ (⟨w, v⟩) - 2βλc 1 r 2 π 2 + 2⟨w, v⟩d ≤ - λ 4 g ′ (⟨w, v⟩) • βλc 1 r 2 π 2 ≤ - c 2 1 4π 2 βλ 2 r 2 provided that β ≥ 2π 2 d λc1r 2 .

Formula formula_252: Π w vv ⊤ Π w op = ∥Π w v∥ 2 = 1 -⟨w, v⟩ 2 , ∀w, ∇ 2 f 0 (w) op ≤ λg ′′ (⟨w, v⟩)(1 -⟨w, v⟩ 2 ) + λC 1 ≤ λ sup s∈[-1,1] g ′′ (s)(1 -s 2 ) + λC 1 ≤ (C 3 + C 1 )λ,

Formula formula_253: w∈S d λ min (∇ 2 f 0 (w)) ≥ -sup w ∇ 2 f 0 (w) op ≥ -(C 3 + C 1 )λ.

Formula formula_254: λ min (∇ 2 f 0 (w)) ≥ -λ sup cos r≤s≤1 |g ′′ (s)| (1 -s 2 ) + λc 1 cos r ≥ -λC 3 sup cos r≤s≤1 1 -s 2 + λc 1 cos r = λ (-C 3 sin r + c 1 cos r) ≥ λ c 1 2

Formula formula_255: J ′ λ [η λ,β ] instead of J ′ λ [δ v ].

Formula formula_256: ∀w ∈ S d , 1 2 ∆J ′ λ [η] - β 4 ∥∇J ′ λ [η]∥ 2 - 1 2 ∆J ′ λ [η ′ ] + β 4 ∥∇J ′ λ [η ′ ]∥ 2 ≤ d 2B 0 B 1 (B 0 B 2 + B 2 1 ) λ 2 ∥y∥ 2 H + β 2B 3 0 B 3 1 λ 3 ∥y∥ 4 H W 1 (η, η ′ ) and λ min (∇ 2 J ′ λ [η]) -λ min (∇ 2 J ′ λ [η ′ ]) ≤ 4B 0 B 1 (B 0 B 2 + B 2 1 ) λ 2 ∥y∥ 2 H W 1 (η, η ′ ).

Formula formula_257: ∇ 2 J ′ λ [η](w) -∇ 2 J ′ λ [η ′ ](w) op ≤ 4B 0 B 1 (B 0 B 2 + B 2 1 ) λ 2 ∥y∥ 2 H W 1 (η, η ′ ) and λ min (∇ 2 J ′ λ [η](w)) -λ min (∇ 2 J ′ λ [η ′ ](w)) ≤ ∇ 2 J ′ λ [η](w) -∇ 2 J ′ λ [η ′ ](w)

Formula formula_258: ∆J ′ λ [η](w) = Tr ∇ 2 J ′ λ [η](W ) and so 1 2 ∆J ′ λ [η] - 1 2 ∆J ′ λ [η ′ ] ≤ d 2 ∇ 2 J ′ λ [η](w) -∇ 2 J ′ λ [η ′ ](w) op ≤ d 2 4B 0 B 1 (B 0 B 2 + B 2 1 ) λ 2 ∥y∥ 2 H W 1 (η, η ′ ).

Formula formula_259: ∥∇J ′ λ [η]∥ ≤ B 0 B 1 λ ∥y∥ 2 H and ∥∇J ′ λ [η] -∇J ′ λ [η ′ ]∥ ≤ 4B 2 0 B 2 1 λ 2 ∥y∥ 2 H W 1 (η, η ′ ), so β 4 ∥∇J ′ λ η]∥ 2 - β 4 ∥∇J ′ λ [η ′ ]∥ 2 ≤ β 4 • 2 B 0 B 1 ∥y∥ 2 H λ • 4B 2 0 B 2 1 ∥y∥ 2 H λ 2 W 1 (η, η ′ ) = β 2B 3 0 B 3 1 ∥y∥ 4 H λ 3 W 1 (η, η ′ ),

Formula formula_260: L 2 (ρ) , ∥φ ′ ∥ L 4 (ρ) , ∥φ ′′ ∥ L 4 (ρ) , E x∼ρ ∥x∥ 4 /d 2 , c 1 , C 1 , C 2 , C 3 and C 4 .

Formula formula_261: * := J ′ λ [η λ,β ] satisfies the conditions (L ′ S d ) (L U ) (C S d ) (C U ) of Thm. F.6 with some constants D ′ 0 , D ′ 1 , D 2 , D 3 , D 4 , r = Θ(1). By Lem. F.3, there exist constants B i = O(1) such that sup w ∇ i ϕ(w) H ≤ B i , for i ∈ {0, 1, 2}.

Formula formula_262: W 2 (η λ,β,δv ) ≲ β -1 dλ -1 • log(βd -1 λ -1 ) =: W .

Formula formula_263: S d ) (L U ) (C S d ) (C U ) for f = f 0 and D i = Θ(

Formula formula_264: ∀w ∈ S d , 1 2 ∆f * - β 4 ∥∇f * ∥ 2 ≲ λd + (dλ -2 + βλ -3 )W ∀w ∈ S d \ U, 1 2 ∆f * - β 4 ∥∇f * ∥ 2 ≤ -D 2 βλ 2 + E 2 • (dλ -2 + βλ -3 )W ∀w ∈ S d , λ min (∇ 2 f * (w)) ≳ -λ -λ -2 W ∀w ∈ U, λ min (∇ 2 f * (w)) ≥ D 4 λ -E 4 • λ -2 W for some constants E 2 , E 4 = O(1). So, • (L ′ S d ) for f * can be ensured with D ′ 1 = O(1) provided that (dλ -2 + βλ -3 )W = (β -1 dλ + 1)βλ -3 W = O(λdβ 3/4

Formula formula_265: U ) can be ensured with D 2 = D2 2 if β is such that E 2 (dλ -2 + βλ -3 )W ≤ D2 2 βλ 2 , i.e., (β -1 dλ + 1)λ -5 W ≤ D2

Formula formula_266: D 4 = D4 4 if E 4 λ -2 W ≤ D4 2 λ, i.e., λ -3 W ≤ D4 2E4 =: F 4 = Θ(1).

Formula formula_267: β 1/4 d -1 λ -4 W ≤ F 2 ⇐⇒ β 1/2 d -2 λ -8 • β -1 dλ -1 log β dλ = β -1/2 λ -9 d -1 log β dλ ≤ F 2 2 .

Formula formula_268: β -1/2 λ -9 d -1 β dλ ε ≤ εF 2 2 ⇐⇒ β 1/2-ε ≥ ε -1 F -2 2 λ -9-ε d -1-ε .

Formula formula_269: * = J ′ λ [η λ,β ] with constants D ′ 1 , D 2 , D 3 , D 4 = O(1), provided that β ≥ Ω(poly(λ -1 , d))

Formula formula_270: W 1 (η λ,β , δ v )

Formula formula_271: 1 , C 1 , C 3 , C 4 > 0 such that ∀r ∈ [-1, +1], c 1 ≤ g ′ (r) ≤ C 1 , g ′′ (r)(1 -r 2 ) 1/2 ≤ C 3 , g ′′′ (r)(1 -r 2 ) 3/2 ≤ C 4 ,

Formula formula_272: 1 , C 1 , C 3 , C 4 such that ∀η, J λ (η) -J λ (δ v ) ≥ λα g W 2 2 (η, δ v ). Proof. Since J λ is convex, J λ (η) -J λ (δ v ) ≥ S d J ′ λ [δ v ]d(η -δ v ) = -λ S d g(⟨v, w⟩)d(η -δ v )(w) = λ S d [g(1) -g(⟨v, w⟩)] dη(w). Now let U r = w ∈ S d ; dist S d (w, v) ≤ r

Formula formula_273: sup θ d 3 dθ 3 g(cos θ) ≤ C 4 + 3C 3 + C 1 =: 6M 3,g .

Formula formula_274: g(cos θ) = g(1) + 0 + 1 2 (0 -g ′ (1))θ 2 + 1 6 (g • cos) (3) (u)θ 3 for some u ∈ [0, r] ≤ g(1) - 1 2 g ′ (1)θ 2 + 1 6 sup [0,r] (g • cos) (3) θ 3 ≤ g(1) - 1 2 g ′ (1)θ 2 + M 3,g θ 3 = g(1) - 1 2 g ′ (1) -M 3,g θ θ 2 ≤ g(1) - 1 4 g ′ (1)θ 2 . (F.7)

Formula formula_275: ∀w ∈ U r , g(1) -g(⟨v, w⟩) ≥ 1 4 g ′ (1) dist S d (w, v) 2 ,

Formula formula_276: Ur [g(1) -g(⟨v, w⟩)] dη(w) ≥ 1 4 g ′ (1) Ur dist S d (w, v) 2 dη(w).

Formula formula_277: S d \Ur [g(1) -g(⟨v, w⟩)] dη(w) ≥ [g(1) -g(cos(r))] [1 -η(U r )] ≥ 1 4 g ′ (1)r 2 [1 -η(U r )]

Formula formula_278: J λ (η) -J λ (δ v ) ≥ λ 1 4 g ′ (1)r 2 [1 -η(U r )] + g ′ (1) 4 Ur dist S d (w, v) 2 dη(w) = λg ′ (1) 4 r 2 [1 -η(U r )] + Ur dist S d (w, v) 2 dη(w) .

Formula formula_279: W 2 2 (η, δ v ) = S d \Ur dist S d (v, w) 2 dη(w) + Ur dist S d (v, w) 2 dη(w) ≤ π 2 [1 -η(U r )] + Ur dist S d (v, w) 2 dη(w). Hence J λ (η) -J λ (δ v ) ≥ λg ′ (1) 4 • sup 0≤r≤ g ′ (1) 2M 3,g min r 2 π 2 , 1 W 2 2 (η, δ v ) = λ • g ′ (1) 4 min g ′ (1) 2M 3,g 2 /π 2 , 1 • W 2 2 (η, δ v ) ≥ λ • c 1 4 min c 1 2M 3,g 2 /π 2 , 1 • W 2 2 (η, δ v ) =: λα g W 2 2 (η, δ v ).

Formula formula_280: -d ≲ C -1 ≲ 1/ √ d.

Formula formula_281: S ϵ = w ∈ S d : dist S d (w, v) ≤ ϵ . There exist universal constants C -, C + > 0 such that ∀0 < ϵ ≤ π 4 , C -1 -(ϵ/2) d ≤ τ (S ϵ ) ≤ C + ϵ d / √ d.

Formula formula_282: Z = 1 -1 (1 -z 2 ) d/2-1 dz = B d 2 , 1 2 = Γ d 2 √ π Γ d+1 2 .

Formula formula_283: x 1-s < Γ(x+1) Γ(x+s) < (x + 1) 1-s applied to s = 1 2 and x = d-1 2 , we have d-1 2 < Γ( d+1 2 ) Γ( d 2 ) < d+1 2 , so 2π d + 1 ≤ Z ≤ 2π d -1 . By definition, since dist S d (w, v) = arccos(⟨w, v⟩), τ (S ϵ ) = 1 cos(ϵ) h(z)dz. One can verify ∀ 0 < ϵ ≤ π 4 , 1 -ϵ 2 ≤ cos(ϵ) ≤ 1 - ϵ 2 4 . So for all 0 < ϵ ≤ π 4 , τ (S ϵ ) = 1 cos(ϵ) h(z)dz ≤ 1 √ 1-ϵ 2 h(z)dz = Z -1 1 √ 1-ϵ 2 (1 -z 2 ) d/2-1 dz = Z -1 1 1-ϵ 2 (1 -t) d/2-1 dt 2 √ t ≤ Z -1 1 2 √ 1 -ϵ 2 1 1-ϵ 2 (1 -t) d/2-1 dt = Z -1 1 2 √ 1 -ϵ 2 ϵ 2 0 t d/2-1 dt = Z -1 1 2 √ 1 -ϵ 2 • 2 d [ϵ 2 ] d/2 ≤ Z -1 1 d 1 -(π/4) 2 ϵ d ≤ C + ϵ d / √ d

Formula formula_284: τ (S ϵ ) ≥ 1 √ 1-ϵ 2 /4 h(z)dz = Z -1 1 √ 1-ϵ 2 /4 (1 -z 2 ) d/2-1 dz = Z -1 1 1-ϵ 2 /4 (1 -t) d/2-1 dt 2 √ t ≥ Z -1 1 2 1 1-ϵ 2 /4 (1 -t) d/2-1 dt = Z -1 1 2 ϵ 2 /4 0 t d/2-1 dt = Z -1 1 2 2 d [ϵ 2 /4] d/2 = Z -1 1 d (ϵ/2) d ≥ c(ϵ/2) d / √ d.

Formula formula_285: ) d / √ d ≥ C -1 -(ϵ/2) d for some universal constants c ′ , C -.

Formula formula_286: sup w ∇ i ϕ(w) H ≤ B i < ∞ for i ∈ {0, 1}, and if β ≥ 4dλ π B 0 B 1 ∥y∥ 2 H -1 , then W 2 (η λ,β , δ v ) ≤ 1 α g β -1 d λ C + log B 0 B 1 ∥y∥ 2 H -log (β -1 dλ)

Formula formula_287: J λ (η λ,β ) ≤ J λ (η λ,β ) + β -1 H (η λ,β |τ ) = J λ,β (η λ,β ) ≤ J λ,β (η σ ) = J λ (η σ ) + β -1 H (η σ |τ ) .

Formula formula_288: J λ (η) -J λ (δ v ) ≥ λα g • W 2 2 (η, δ v ), so λα g • W 2 2 (η λ,β , δ v ) ≤ J λ (η λ,β ) -J λ (δ v ) ≤ J λ (η σ ) -J λ (δ v ) + β -1 H (η σ |τ ) .

Formula formula_289: J λ (η σ ) -J λ (δ v ) ≤ B 0 B 1 ∥y∥ 2 H λ • W 1 (η σ , δ v )

Formula formula_290: W 1 (η σ , δ v ) = dist S d (w, v) dη σ (w) = 1 vol(S σ ) Sσ dist S d (w, v) d vol(w) ≤ σ.

Formula formula_291: H (η σ |τ ) = dη σ log dη σ dτ = log vol(S d ) vol(S σ ) = -log τ (S σ ) ≤ log C -d log σ 2

Formula formula_292: J λ (η σ ) -J λ (δ v ) + β -1 H (η σ |τ ) ≤ B 0 B 1 ∥y∥ 2 H λ σ -β -1 d log σ + β -1 d log 2C.

Formula formula_293: λα g • W 2 2 (η λ,β , δ v ) ≤ inf 0<σ≤ π 4 B 0 B 1 ∥y∥ 2 H λ σ -β -1 d log σ + β -1 d log 2C = β -1 d -β -1 d log β -1 dλ B 0 B 1 ∥y∥ 2 H + β -1 d log 2C = β -1 d 1 + log(2C) -log(β -1 dλ) + log B 0 B 1 ∥y∥ 2 H

Formula formula_294: W 2 (η λ,β , δ v ) ≤ 1 λα g β -1 d 1 + log(2C) -log(β -1 dλ) + log B 0 B 1 ∥y∥ 2 H

Formula formula_295: P k,d (t) = (-1) k Γ(d/2) 2 k Γ(k + d/2) (1 -t 2 ) (2-d)/2 d dt k (1 -t 2 ) k+(d-2)/2 .

Formula formula_296: N (d,k) j=1 Y kj (w)Y kj (v) = N (d, k)P k,d (⟨w, v⟩), ∀w, v ∈ S d . • (Hecke-Funk Formula) Suppose ϕ ∈ L 2 (τ ) is given by ϕ(•) = φ(⟨w, •⟩) for some w ∈ S d . Then [AH12, Theorem 2.22], ⟨ϕ, Y kj ⟩ L 2 (τ ) = Γ((d + 1)/2) Γ(d/2) √ π Y kj (w) 1 -1 φ(t)P k (t)(1 -t 2 ) (d-2)/2 dt.

Formula formula_297: ⟨P k,d (⟨w, •⟩), P k ′ ,d (⟨v, •)⟩ L 2 (τ ) = δ kk ′ P k,d (⟨w, v⟩) N (d, k) .

Formula formula_299: q ′ (⟨w, v⟩) = 1 d + 1 E ∥x∥ 2 φ ′ (⟨w, x⟩)φ ′ (⟨v, x⟩) = 1 d + 1 E ∥x∥ 2 E [φ ′ (

Formula formula_300: (λ + ∥φ∥ 2 L 2 (ρ) ) 2 ≤ g ′ ≤ b 2 ∥φ∥ 2 L 2 (ρ) ∥φ ′ ∥ 2 L 4 (ρ) (λ + ∥φ∥ 2 L 2 (ρ)

