PRISM: Patch Diffusion with Dynamic Retrieval Augmented Guidance and Permutation Invariant Conditioning

TMLR Paper6741 Authors

01 Dec 2025 (modified: 06 Dec 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Diffusion models have achieved state-of-the-art results in image generation but often require extensive computational resources and large-scale datasets, limiting their practicality in resource-constrained settings. To address these challenges, we introduce PRISM, a retrieval-guided, patch-based method that trains solely on image patches instead of full resolution images. PRISM achieves superior global coherence and outperforms patch-only baselines, even when trained on only a fraction of the data. For each training example, PRISM retrieves semantically related neighbors from a disjoint retrieval set using CLIP embeddings. It aggregates their unordered signals with a Set Transformer, ensuring permutation-invariant conditioning that captures higher-order relationships. A dynamic neighbor-annealing schedule optimizes the contextual guidance over time, leading to more coherent results. Experiments on unconditional image generation tasks using CIFAR-10, CelebA, ImageNet-100, and AFHQv2 datasets, along with ablation studies, validate our approach, demonstrating that retrieval-augmented, set-based conditioning closes the coherence gap in patch-only diffusion.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Lu_Jiang1
Submission Number: 6741
Loading