When rule learning breaks: Diffusion Fails to Learn Parity of Many Bits

Binxu Wang; Emma Lucia Byrnes Finn; Bingbin Liu

When rule learning breaks: Diffusion Fails to Learn Parity of Many Bits

Binxu Wang, Emma Lucia Byrnes Finn, Bingbin Liu

Published: 23 Sept 2025, Last Modified: 23 Dec 2025SPIGM @ NeurIPS OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion, Parity, Rule learning, Creativity, generalization, learning dynamics, diffusion transformer

TL;DR: Diffusion transformers can learn simple parity rules hidden in data, but fail on many-bit parity, with depth extending—but not eliminating—the limit.

Abstract: Diffusion models can generate highly realistic samples, but do they learn the latent rules that govern a distribution, and if so, what kind of rule can they learn? We address this question using a controlled \emph{group-parity} benchmark on $6{\times}6$ binary images, where each group of $G$ bits must satisfy an even-parity constraint. This setup allows us to precisely tune rule complexity via $G$ and measure both correctness and memorization at the group and sample levels. Using EDM-parameterized Diffusion Transformers of varying depth, we find that: (i) learnability depends jointly on $G$ and depth, with deeper models extending—but not eliminating—the range of learnable rules; (ii) successful rule learning exhibits a sharp early transition in accuracy that precedes memorization, creating a temporal window for generalization; (iii) memorization onset follows a steps-per-sample scaling law and is delayed by larger datasets. Further, we analyze the energy/score to relate learning difficulty to the group size $G$ and the model depth. Together, these results offer a principled testbed and new insights into the interplay between rule complexity, rule learning, and memorization in diffusion models.

Submission Number: 123

Loading