POSITION EMBEDDING INTERPOLATION IS ALL YOU NEED FOR EFFICIENT IMAGE-TO-IMAGE VIT

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Supplementary Material: zip
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: super-resolution, image inpainting, vision Transformer, position embedding interpolation, diffusion model
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Recently, general image inpainting methods have made great progress in free-form large-miss region restoration, but it is still challenging to inpaint a high-resolution image directly to obtain a photo-realistic image and maintain a low training and inferring cost simultaneously. To address this, we propose a computation-efficient framework with a diffusion model and a ViT-based super-resolution (ViTSR) module. In this paper, we train the guided diffusion model for inpainting the image in low-resolution to reduce the training and inferring costs and use ViTSR for reconstructing the image to the original high-resolution. The idea is simple to understand, but the key point is that our framework requires an excellent reconstruction module to bring the low-resolution output to high resolution and hardly discriminate compared to the origin image in texture. ViTSR employs the vanilla ViT architecture and utilizes position embedding interpolation (PEI) to make the module capable of training at low resolution and suiting any resolution when inferring. ViTSR leverages latent image-to-image translation to capture global attention information and reconstruct the image with state-of-the-art performance. In the experiments on CelebA, Places2, and other datasets, this framework obtained superior performance in high-resolution image inpainting and super-resolution tasks. We further propose a general ViT-based auto-encoder for image-to-image translation tasks that can be accelerated by position embedding interpolation.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4395
Loading