When rule learning breaks: Diffusion Fails to Learn Parity of Many Bits

Published: 23 Sept 2025, Last Modified: 23 Dec 2025SPIGM @ NeurIPS OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion, Parity, Rule learning, Creativity, generalization, learning dynamics, diffusion transformer
TL;DR: Diffusion transformers can learn simple parity rules hidden in data, but fail on many-bit parity, with depth extending—but not eliminating—the limit.
Abstract: Diffusion models can generate highly realistic samples, but do they learn the latent rules that govern a distribution, and if so, what kind of rule can they learn? We address this question using a controlled \emph{group-parity} benchmark on $6{\times}6$ binary images, where each group of $G$ bits must satisfy an even-parity constraint. This setup allows us to precisely tune rule complexity via $G$ and measure both correctness and memorization at the group and sample levels. Using EDM-parameterized Diffusion Transformers of varying depth, we find that: (i) learnability depends jointly on $G$ and depth, with deeper models extending—but not eliminating—the range of learnable rules; (ii) successful rule learning exhibits a sharp early transition in accuracy that precedes memorization, creating a temporal window for generalization; (iii) memorization onset follows a steps-per-sample scaling law and is delayed by larger datasets. Further, we analyze the energy/score to relate learning difficulty to the group size $G$ and the model depth. Together, these results offer a principled testbed and new insights into the interplay between rule complexity, rule learning, and memorization in diffusion models.
Submission Number: 123
Loading