EdiBERT: a generative model for image editing
Abstract: Advances in computer vision are pushing the limits of image manipulation, with generative models sampling highly-realistic detailed images on various tasks. However, a specialized model is often developed and trained for each specific task, even though many image edition tasks share similarities. In denoising, inpainting, or image compositing, one always aims at generating a realistic image from a low-quality one. In this paper, we aim at making a step towards a unified approach for image editing. To do so, we propose EdiBERT, a bidirectional transformer that re-samples image patches conditionally to a given image. Using one generic objective, we show that the model resulting from a single training matches state-of-the-art GANs inversion on several tasks: image denoising, image completion, and image composition. We also provide several insights on the latent space of vector-quantized auto-encoders, such as locality and reconstruction capacities. The code is available at https://github.com/EdiBERT4ImageManipulation/EdiBERT.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: #1st revision According to the reviewers' requests, we made the following changes: - We clarified the introduction, and briefly explained the difference between the denoising and inpainting tasks. - After recalling the formula of the attention mechanism, we detailed the training objective of VQGAN in the related work and how our work was built on top of it. - We added Figure 2 to explain and better visualize the training of EdiBERT and, more specifically, our 2D selection strategy. - We changed Figure 4 to compare EdiBERT with GANs inversion methods on the task of reconstructing target images and added a quantitative comparison based on LPIPS. - We added Figure 7 for a better comparison of EdiBERT on the task of inpainting. - Finally, we clarified some formulas and extended many figure captions. #2nd revision - Updated related work with more diffusion-related papers. - Added a new figure in Appendix displaying qualitative results on the task of image compositing.
Assigned Action Editor: ~Jia-Bin_Huang1
Submission Number: 299