Revisiting Contrastive Divergence for Density Estimation and Sample Generation

TMLR Paper5150 Authors

18 Jun 2025 (modified: 02 Sept 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Energy-based models (EBMs) have recently attracted renewed attention as models for complex distributions of data, like natural images. Improved image generation under the maximum-likelihood (MLE) objective has been achieved by combining very complex energy functions, in the form of deep neural networks, with Langevin dynamics for sampling from the model. However, Nijkamp and colleagues have recently shown that such EBMs become good generators without becoming good density estimators: an impractical number of Langevin steps is typically required to exit the burn-in of the Markov chain, so the training merely sculpts the energy landscape near the distribution used to initialize the chain. Careful hyperparameter choices and the use of persistent chains can significantly shorten the required number of Langevin steps, but at the price that new samples can be generated only in the vicinity of the persistent chain and not from noise. Here we introduce a simple method to achieve both convergence of the Markov chain in a practical number of Langevin steps (L = 500) and the ability to generate diverse, high-quality samples from noise. Under the hypothesis that Hinton’s classic contrastive-divergence (CD) training does yield good density estimators, but simply lacks a mechanism for connecting the noise manifold to the learned data manifold, we combine CD with an MLE-like loss. We demonstrate that a simple ConvNet can be trained with this method to be good at generation as well as density estimation for CIFAR-10, Oxford Flowers, and a synthetic dataset in which the learned density can be verified visually.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: In response to reviewer comments, we have - added a set of experiments probing sample diversity (Section 4.2.4); - added results on FID for long-run chains, which helps confirm the visual intuition that hybrid-trained models have converged in the short run; - added energy distribution plots for Oxford flowers a la their counterparts for CIFAR-10 in Fig. 5 (see reply to rfCw for a precis); - added Algorithm 1 (pseudo-code); - added FID scores for other values of $\lambda$; - added a section on sample diversity (including what is now Fig. 7); - added a fuller description of OOD detection to the Methods; - provided a longer discussion on the optimal value of $\lambda$; - removed $T$ from the equation for LD (3), but then provided the more general version with $T$ in the supplementary material; - improved Fig. 1 (schematic); - improved notation to distinguish better between Langevin samples and elements of a training batch; - made other small edits requested by reviewers. We also plan to add the results of two more experiments, but are submitting material now so that the reviewers can see and review the changes we have made to this point.
Assigned Action Editor: ~Yingnian_Wu1
Submission Number: 5150
Loading