['1c1', '< Title: Mean-Field Langevin Dynamics for Signed Measures via a Bilevel Approach', '---', '> Title: Mean-Field Langevin Dynamics for Signed Measures: Enhanced Convergence via Bilevel Optimization', '3c3', '< Abstract: Mean-field Langevin dynamics (MLFD) is a class of interacting particle methods that tackle convex optimization over probability measures on a manifold, which are scalable, versatile, and enjoy computational guarantees. However, some important problems -such as risk minimization for infinite width two-layer neural networks, or sparse deconvolution -are originally defined over the set of signed, rather than probability, measures. In this paper, we investigate how to extend the MFLD framework to convex optimization problems over signed measures. Among two known reductions from signed to probability measures -the lifting and the bilevel approaches -we show that the bilevel reduction leads to stronger guarantees and faster rates (at the price of a higher per-iteration complexity). In particular, we investigate the convergence rate of MFLD applied to the bilevel reduction in the low-noise regime and obtain two results. First, this dynamics is amenable to an annealing schedule, adapted from [SWON23], that results in improved convergence rates to a fixed multiplicative accuracy. Second, we investigate the problem of learning a single neuron with the bilevel approach and obtain local exponential convergence rates that depend polynomially on the dimension and noise level (to compare with the exponential dependence that would result from prior analyses). * Equal contributions, authors ordered randomly. 1 The square exponent on ∥ • ∥T V might appear unusual, but it is convenient for our subsequent developments. We show in App. A that the regularization path is the same with or without the square.', '---', '> Abstract: Mean-field Langevin dynamics (MFLD) provides a scalable and robust framework for convex optimization over probability measures. However, a significant class of problems, including risk minimization in infinite-width two-layer neural networks and sparse deconvolution, are intrinsically formulated over signed measures. This paper addresses the fundamental challenge of extending MFLD to these convex optimization problems on signed measures. We rigorously compare two reduction strategies: lifting and bilevel approaches. Our analysis demonstrates that the bilevel reduction consistently yields superior theoretical guarantees and faster convergence rates, despite a higher per-iteration complexity, making it the preferred method. Our contributions are twofold: First, we introduce a novel annealing schedule, adapted from [SWON23], for MFLD applied to the bilevel reduction. This schedule significantly improves convergence rates to a fixed multiplicative accuracy. Second, for the specific problem of learning a single neuron, our bilevel MFLD approach achieves local exponential convergence rates that scale polynomially with dimension and noise level, a substantial improvement over the exponential dependencies typically found in prior analyses. This work firmly establishes the bilevel approach as an efficient and theoretically sound method for extending MFLD to the broader domain of signed measures.', '6c6', '< Let M(W) be the set of finite signed measures on a compact Riemannian manifold without boundaries W and let G : M(W) → R be a convex function, assumed smooth in the sense of Assumption 1 below. In this paper, we investigate optimization methods to solve', '---', '> Many contemporary problems in machine learning and optimization are naturally formulated as convex optimization over the space of finite signed measures M(W) on a compact Riemannian manifold W. These problems often involve minimizing a convex function G : M(W) → R, regularized by the total variation norm, taking the form:', '8,9c8', '< where ∥ • ∥ T V is the total variation norm and λ > 0 the regularization level. 1 This covers for instance risk minimization for infinite-width 2-layer neural networks (2NN) [BRVDM05;Bac17] by taking W = S d the unit sphere in R d+1 or W = R d+1 and G(ν) = E (x,y)∼ρ ℓ(h(ν, x), y) where h(ν, x) = W φ(⟨x, w⟩)dν(w). (1.2)', '< Here φ : R → R is the activation function, h(ν, •) is the predictor parameterized by ν, G is the (population or empirical) risk under the data distribution ρ ∈ P(R d+1 ×R), and ℓ is smooth (uniformly in y) and convex in its first argument. These 2NNs will be our guiding examples throughout, but note that the class of problems covered by Eq. (1.1) is more general and includes for instance sparse deconvolution via the Beurling-LASSO estimator [DG12] or optimal design [MZ04].', '---', "> where ∥ • ∥ T V denotes the total variation norm and λ > 0 is the regularization parameter. 1 Such formulations are crucial for diverse applications, including risk minimization in infinite-width two-layer neural networks (2NNs) [BRVDM05;Bac17]—where ν parameterizes the network's first-layer weights on W = S d (the unit sphere in R d+1 ) or W = R d+1 —and sparse deconvolution via the Beurling-LASSO estimator [DG12], as well as optimal design problems [MZ04]. In the 2NN context, G(ν) typically represents the (population or empirical) risk E (x,y)∼ρ ℓ(h(ν, x), y) for a predictor h(ν, x) = W φ(⟨x, w⟩)dν(w), with φ being an activation function and ℓ a smooth, convex loss. This paper investigates efficient optimization methods for problems of this general form.", '15,19c14,20', '< At first, it is not obvious that MFLD can be applied at all since it is originally defined only for problems over probability measures. However, we can find in the literature two general recipes to reduce a problem over M(W) to a problem over P(W ′ ), thus amenable to MFLD. The first one is a lifting reduction, that takes W ′ = R × W where the extra dimension serves to encode the signed mass of particles [CB18, Section A.2] [Chi22c]. The second one, that takes W ′ = W, is a bilevel reduction [Bac21; TS24] that uses a variational representation of the regularizer ∥ • ∥ 2 T V , common in the multiple kernel learning literature [LCBGJ04]. A first task is thus to compare the behavior of MFLD on these two approaches. Furthermore, MFLD involves an entropic regularization which is absent from Eq. (1.1). A second task is thus to analyze the behavior of MFLD in the large β regime, when the regularization vanishes.', '< In this work, we tackle these two tasks and make the following contributions:', '< • In Sec. 3, we introduce the lifting and bilevel reductions and compare the "displacement smoothness" (P1) and "uniform LSI" (P2) properties of the resulting problems. These properties play a central role in the global convergence analysis of MFLD. Specifically, we consider a large class of lifting reductions and show that none satisfies simultaneously (P1) and (P2) unless λ is large. In contrast, the bilevel reduction satisfies both under mild assumptions. So in the sequel we focus on MFLD applied to the bilevel reduction. • In Sec. 4, we investigate what convergence rates can be obtained for the problem (1.1) by using MFLD on the bilevel formulation. While a classical simulated annealing technique yields convergence in O(log log t/ log t), we show that the structure of the bilevel objective is in fact amenable to a more efficient annealing schedule, adapted from [SWON23], that reaches a fixed multiplicative accuracy, say 1.01 inf G λ , in time e O(λ -1 log λ -1 ) instead of e O(λ -2 ) for the classical schedule.', '< • In Sec. 5, to obtain a more complete picture, we investigate the problem of learning a single neuron. Here, using a Lyapunov type argument, we show that the local convergence rate of MFLD applied to the bilevel formulation scales polynomially in β and d, at odds with all previous MFLD analyses which had exponential dependencies.', '< All proofs are deferred to the Appendix.', '---', '> Two primary strategies exist for adapting problems over signed measures M(W) to the MFLD framework, which is inherently defined for probability measures P(W ′ ). The first is a lifting reduction [CB18, Section A.2; Chi22c], which expands the state space to W ′ = R × W to encode signed mass. The second is a bilevel reduction [Bac21; TS24], which operates on W ′ = W and leverages a variational representation of the total variation squared norm, common in multiple kernel learning [LCBGJ04]. A key challenge is to rigorously compare the performance of MFLD under these two reductions. Additionally, since MFLD includes an entropic regularization absent from Eq. (1.1), understanding its behavior in the low-noise (large β) regime is crucial.', '> ', '> Our work addresses these challenges with the following core contributions:', '> • In Section 3, we conduct a comprehensive comparison of lifting and bilevel reductions, analyzing their "displacement smoothness" (P1) and "uniform LSI" (P2) properties, which are critical for MFLD\'s global convergence. We demonstrate that a broad class of lifting reductions fails to satisfy both (P1) and (P2) simultaneously, except under restrictive conditions (large λ). Conversely, the bilevel reduction robustly satisfies both properties under mild assumptions, establishing it as the more suitable approach for MFLD. Consequently, our subsequent analysis focuses exclusively on MFLD applied to the bilevel reduction (MFLD-Bilevel).', '> • Section 4 delves into the convergence rates achievable for problem (1.1) using MFLD-Bilevel. While standard simulated annealing yields a slow O(log log t/ log t) rate, we introduce a novel, more efficient annealing schedule, adapted from [SWON23]. This schedule exploits the inherent structure of the bilevel objective, achieving a fixed multiplicative accuracy (e.g., 1.01 inf G λ ) in e O(λ -1 log λ -1 ) time, a significant improvement over the e O(λ -2 ) rate of classical annealing.', '> • In Section 5, we provide a deeper understanding of MFLD-Bilevel by analyzing the specific problem of learning a single neuron. Through a detailed Lyapunov-type argument, we establish that the local exponential convergence rate of MFLD-Bilevel scales polynomially with respect to both dimension (d) and noise level (β), a stark contrast to the exponential dependencies typically found in prior MFLD analyses. This result highlights the efficiency and practical applicability of our bilevel approach.', '> All theoretical proofs and supplementary details are rigorously provided in the Appendix.', '1682d1682', '< ']
