Abstract: In recent years, diffusion models have dominated the field of image generation with their outstanding generation quality. However, pre-trained large-scale diffusion models are generally trained using fixed-size images, and fail to maintain their performance at different aspect ratios. Existing methods for generating arbitrary-size images based on diffusion models face several issues, including the requirement for extensive finetuning or training, sluggish sampling speed, and noticeable edge artifacts. This paper presents the InstantAS method for arbitrary-size image generation. This method performs non-overlapping minimum coverage segmentation on the target image, minimizing the generation of redundant information and significantly improving sampling speed. To maintain the consistency of the generated image, we also proposed the Inter-Domain Distribution Bridging method to integrate the distribution of the entire image and suppress the separation of diffusion paths in different regions of the image. Furthermore, we propose the dynamic semantic guided cross-attention method, allowing for the control of different regions using different semantics. InstantAS can be applied to nearly any existing pre-trained Text-to-Image diffusion model. Experimental results show that InstantAS has better fusion capabilities compared to previous arbitrary-size image generation methods and is far ahead in sampling speed compared to them.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Content] Vision and Language, [Content] Multimodal Fusion
Relevance To Conference: This paper proposes a novel acceleration sampling technique for arbitrary-size image generation with diffusion models. Specifically, this paper proposes a non-overlapping minimum coverage sampling method to reduce redundant information and an inter-domain distribution bridging method to ensure global consistency, which can effectively improve the sampling efficiency of diffusion models and generate high-quality images. Additionally, this paper also proposes a regional information control method, which can integrate textual information and mask information of the control region to achieve controlled generation using different information for different regions in the image.
The findings of this paper are relevant to the following topics of ACM MM 2024:
Generative Multimedia: This paper proposes a novel acceleration sampling technique for arbitrary-size image generation with diffusion models, which can effectively improve the sampling efficiency of diffusion models and generate high-quality images. In addition, this method can be used on any pre-trained diffusion model, helping to expand the functionality of generative multimedia models.
Vision and Language: This paper proposes a regional information control method that can better control the information distribution during the image generation process.
Supplementary Material: zip
Submission Number: 2925
Loading