Sampling Binary Data by Denoising through Score Functions

Francis Bach; Saeed Saremi

Sampling Binary Data by Denoising through Score Functions

Francis Bach, Saeed Saremi

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Gaussian smoothing combined with a probabilistic framework for denoising via the empirical Bayes formalism, i.e., the Tweedie-Miyasawa formula (TMF), are the two key ingredients in the success of score-based generative models in Euclidean spaces. Smoothing holds the key for easing the problem of learning and sampling in high dimensions, denoising is needed for recovering the original signal, and TMF ties these together via the score function of noisy data. In this work, we extend this paradigm to the problem of learning and sampling the distribution of binary data on the Boolean hypercube by adopting Bernoulli noise, instead of Gaussian noise, as a smoothing device. We first derive a TMF-like expression for the optimal denoiser for the Hamming loss, where a score function naturally appears. Sampling noisy binary data is then achieved using a Langevin-like sampler which we theoretically analyze for different noise levels. At high Bernoulli noise levels sampling becomes easy, akin to log-concave sampling in Euclidean spaces. In addition, we extend the sequential multi-measurement sampling of Saremi et al. (2024) to the binary setting where we can bring the "effective noise" down by sampling multiple noisy measurements at a fixed noise level, without the need for continuous-time stochastic processes. We validate our formalism and theoretical findings by experiments on synthetic data and binarized images.

Lay Summary: Recent breakthroughs in data generation rely on two key ideas: adding noise to make complex data easier to model, and then learning how to reverse that noise to recover the original. These methods have worked well for images and other data in continuous spaces, where the noise is typically Gaussian. In this work, we explore how to bring these powerful tools to binary data (data made up of only 0s and 1s) by using a different kind of noise called Bernoulli noise, which is akin to randomly flipping bits. The probability of these flips defines the key parameter of our model. We show how to recover the original data from its noisy version, using a mathematical formula similar to the one that works in the continuous case. This allows us to define a new way of sampling binary data, inspired by methods originally developed for continuous spaces. We also adapt a technique for improving the quality of samples by combining multiple noisy versions of the same data. This helps us get better results without needing to simulate complex, continuous-time processes. We test our method on both synthetic data and black-and-white images, and our results support the theory.

Primary Area: Probabilistic Methods->Monte Carlo and Sampling Methods

Keywords: Langevin MCMC, score function, Boolean hypercube, Bernoulli noise

Submission Number: 7951

Loading