Abstract: Diffusion models now generate high-quality, diverse samples, with an increasing focus on
more powerful models. Although ensembling is a well-known way to improve supervised
models, its application to unconditional score-based diffusion models remains largely un-
explored. In this work we investigate whether it provides tangible benefits for generative
modelling. We find that while ensembling the scores generally improves the score-matching
loss and model likelihood, it fails to consistently enhance perceptual quality metrics such as
FID on image datasets. We confirm this observation across a breadth of aggregation rules
using Deep Ensembles, Monte Carlo Dropout, on CIFAR-10 and FFHQ. We attempt to ex-
plain this discrepancy by investigating possible explanations, such as the link between score
estimation and image quality. We also look into tabular data through random forests, and
find that one aggregation strategy outperforms the others. Finally, we provide theoretical
insights into the summing of score models, which shed light not only on ensembling but also
on several model composition techniques (e.g. guidance). Our Python code is available at
https://anonymous.4open.science/r/score_diffusion_ensemble-B758.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Sylvain_Le_Corff1
Submission Number: 5530
Loading