Keywords: Diffusion models, alignment, image synthesis
TL;DR: A paper on enhancement methods for text-to-image models.
Abstract: The emergence of diffusion models has significantly advanced image synthesis. Recent studies of model interaction and self-corrective reasoning approaches in large language models offer new insights for enhancing text-to-image models. Inspired by these studies, we propose a novel method called ArtAug for enhancing text-to-image models via model interactions with understanding models. In the interactions, we leverage human preferences implicitly learned by image understanding models to provide fine-grained suggestions for image generation models. The interactions can modify the image content to make it aesthetically pleasing, such as adjusting exposure, changing shooting angles, and adding atmospheric effects. The enhancements brought by the interaction are iteratively fused into the generation model itself through an additional enhancement module. This enables the generation model to produce aesthetically pleasing images directly with no additional inference cost. In the experiments, we verify the effectiveness of ArtAug on advanced models such as FLUX, Stable Diffusion 3.5 and Qwen2-VL, with extensive evaluations in metrics of image quality, human evaluation, and ethics. The source code and models will be released publicly.
Primary Area: generative models
Submission Number: 10112
Loading