Abstract: Transformers have achieved great success in pluralis- tic image inpainting recently. However, we find existing transformer based solutions regard each pixel as a token, thus suffer from information loss issue from two aspects: 1) They downsample the input image into much lower res- olutions for efficiency consideration, incurring information loss and extra misalignment for the boundaries of masked regions. 2) They quantize 2563 RGB pixels to a small num- ber (such as 512) of quantized pixels. The indices of quan- tized pixels are used as tokens for the inputs and predic- tion targets of transformer. Although an extra CNN net- work is used to upsample and refine the low-resolution re- sults, it is difficult to retrieve the lost information back. To keep input information as much as possible, we propose a new transformer based framework “PUT”. Specifically, to avoid input downsampling while maintaining the compu- tation efficiency, we design a patch-based auto-encoder P- VQVAE, where the encoder converts the masked image into non-overlapped patch tokens and the decoder recovers the masked regions from the inpainted tokens while keeping the unmasked regions unchanged. To eliminate the information loss caused by quantization, an Un-Quantized Transformer (UQ-Transformer) is applied, which directly takes the fea- tures from P-VQVAE encoder as input without quantization and regards the quantized tokens only as prediction tar- gets. Extensive experiments show that PUT greatly outper- forms state-of-the-art methods on image fidelity, especially for large masked regions and complex large-scale datasets.
0 Replies
Loading