Diffusion-based Annealed Boltzmann Generators : benefits, pitfalls and hopes

TMLR Paper7222 Authors

28 Jan 2026 (modified: 27 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Sampling configurations at thermodynamic equilibrium is a central challenge in statistical physics. Boltzmann Generators (BGs) address this problem by pairing a generative model with a Monte Carlo (MC) correction scheme, yielding asymptotically consistent samples from an unnormalized target density. However, most existing BGs rely on classic MC mechanisms such as importance sampling, which (i) impose strong constraints on the backbone model (typically requiring exact and efficient likelihood evaluation) and (ii) suffer from severe scalability issues in high-dimensional, multi-modal settings. This work investigates BGs built around annealed Monte Carlo (aMC) schemes, which mitigate the limitations of classic MC by bridging a simple reference distribution to the target through a sequence of intermediate densities. In this context, diffusion models (DMs) are particularly appealing backbones: they are powerful generative models and naturally induce density paths that have been leveraged in prior aMC-based methods. We provide an empirical meta-analysis of this DM-based aMC-BG design choice on controlled yet challenging synthetic benchmarks based on multi-modal Gaussian mixtures, varying inter-mode separation, number of modes, and dimensionality. To disentangle learning effects from inference effects, we first study an idealized setting in which the DM is perfectly learned, and then turn to realistic settings where the DM is trained from data. Even in the idealized regime, we find that standard aMC integrations of DMs that rely only on first-order stochastic denoising kernels systematically fail in the proposed scenarios. In contrast, incorporating second-order denoising kernels can substantially improve performance when the required covariance information is available. Motivated by this gap, we propose an alternative aMC integration based on deterministic first-order transport maps derived from DMs; empirically, this approach consistently outperforms its stochastic first-order counterpart, albeit at increased computational cost. Overall, while results in the perfect-learning regime suggest that exploiting DM-induced dynamics within aMC is a promising route to building effective BGs, our experiments with learned DMs show that DM–aMC combinations still struggle to produce accurate BGs in practice. We attribute this limitation primarily to inaccuracies in DM log-density estimation.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: All changes are displayed in blue in the revised manuscript. **Streamlined Experimental Protocol.** We reduced the number of experimental configurations to 3 representative cases per target family (*TwoModes* and *ManyModes*), enabling more ablations within the same computational budget. The revised manuscript is 10 pages shorter despite containing more content. **Improved First-Order Kernels.** We now default to the DDPM denoising kernel $q_{s|t}(\cdot|x_t) = \mathcal{N}(m_{s|t}(x_t), \sigma^2_{t|s} I_d)$ in place of the EI scheme used in prior work, shown empirically to outperform it across all annealed Monte Carlo (aMC) variants. **$\Lambda$-Optimal Annealing Schedule.** Following reviewer recommendation, we precompute the $\Lambda$-optimal annealing schedule for all targets by computing the local communication barriers $\lambda(t)$ and inverting the cumulative profile $\Lambda(t) = \int_0^t \lambda(u)\, du$. This gives tempering its most favorable setting; diffusion paths remain decisively superior. **Log-Normalizing Constant as Additional Metric.** We now report $\log \mathcal{Z}$ (equal to $0$ in our standardized setting) across all experiments as a complementary diagnostic. It consistently agrees with the Sliced Wasserstein distance and provides particularly clear quantitative evidence of the superiority of diffusion paths over tempering. **Theoretical Guarantees for Deterministic Transitions.** This is the most substantial change. We now use the lower-variance Hutch++ estimator in place of the standard Hutchinson estimator, and embed all deterministic aMC methods within the penalty correction of Ceperley & Dewing (1999). Concretely, the noisy log-Jacobian estimate $\hat{\ell}(x)$ of $\log|\det J_T(x)|$ is corrected by subtracting the penalty term $$u_B(\hat{V}(x), N) = \frac{\hat{V}(x)}{N} + \frac{\hat{V}(x)^2}{4(N+1)} + \cdots$$ This provably renders AIS/SMC importance weights unbiased and consistent, and guarantees that deterministic RE admits the correct invariant distribution $\bar{\pi}$. Extensive ablations over $n_\text{iter}$, $n_\text{trunc}$, and $n_\text{hutch}$ confirm negligible residual error in practice. **Semi-Ideal Experiment.** We added a semi-ideal condition pairing exact diffusion-path densities $p_{t}$ with learned scores $s^\theta_t \approx \nabla \log p_t$. Performance recovers to near-ideal levels, directly isolating log-density estimation as the primary bottleneck in the realistic setting. **Additional Objectives and Architecture Search.** We added the DiffCLF objective (OuYang et al., 2026), extended the hyperparameter search, and trained in continuous time, yielding only marginal improvements over the original realistic setting results.
Assigned Action Editor: ~Julius_Berner1
Submission Number: 7222
Loading