Keywords: generative models, computer vision, machine learning
TL;DR: We introduce Palette, a simple unified framework for image-to-image translation using conditional diffusion models.
Abstract: We introduce Palette, a simple and general framework for image-to-image translation using conditional diffusion models. Palette models trained on four challenging image-to-image translation tasks (colorization, inpainting, uncropping, and JPEG restoration) outperform strong GAN and regression baselines and bridge the gap with natural images in terms of sample quality scores. This is accomplished without task-specific hyper-parameter tuning, architecture customization, or any auxiliary loss, demonstrating a desirable degree of generality and flexibility. We uncover the impact of an $L_2 $vs. $L_1$ loss in the denoising diffusion objective on sample diversity, and demonstrate the importance of self-attention through empirical architecture studies. Importantly, we advocate a unified evaluation protocol based on ImageNet, with human evaluation and sample quality scores (FID, Inception Score, Classification Accuracy of a pre-trained ResNet-50, and Perceptual Distance against original images). We expect this standardized evaluation protocol to play a critical role in advancing image-to-image translation research. Finally, we show that a generalist, multi-task Palette model performs as well or better than task-specific specialist counterparts. Check out https://bit.ly/palette-diffusion for more details.