Abstract: Significant achievements in personalization of diffusion models have been witnessed. Conventional tuning-free methods mostly encode multiple reference images by averaging or concatenating their image embeddings as the injection condition, but such an image-independent operation cannot perform interaction among images to capture consistent visual elements within multiple references. Although tuning-based approaches can effectively extract consistent elements within multiple images through the training process, it necessitates test-time finetuning for each distinct image group. This paper introduces EasyRef, a plug-and-play adaption method that empowers diffusion models to condition consistent visual elements (e.g., style and human facial identity, etc.) across multiple reference images under instruction controls. To effectively exploit consistent visual elements within multiple images, we leverage the multi-image comprehension and instruction-following capabilities of the multimodal large language model (MLLM), prompting it to capture consistent visual elements based on the instruction. Besides, injecting the MLLM's representations into the diffusion process through adapters can easily generalize to unseen domains. To mitigate computational costs and enhance fine-grained detail preservation, we introduce an efficient reference aggregation strategy and a progressive training scheme. Finally, we introduce MRBench, a new multi-reference image generation benchmark. Experimental results demonstrate EasyRef surpasses both tuning-free and tuning-based methods, achieving superior aesthetic quality and robust zero-shot generalization across diverse domains.
Lay Summary: When creating images with diffusion models, users often want to personalize the output based on multiple reference images, like combining features from several faces to make a new portrait. Existing methods either average the information from these images without considering their consistent elements or require costly retraining every time a user wants to personalize the model for a new set of references. To solve this, we developed EasyRef, an easy-to-use technique that lets diffusion models consistently incorporate visual elements, such as artistic style or facial identity, from multiple reference images, guided by simple instructions. We achieve this by leveraging a powerful visual-language model that understands multiple images together and identifies consistent features based on given prompts. By smoothly integrating this model’s understanding into the diffusion process, EasyRef adapts effortlessly even to types of images it has never encountered before. Additionally, our method uses an approach to combine reference images efficiently, saving computation and preserving fine details. We also introduce MRBench, a new benchmark for evaluating multi-reference image generation. Experiments show that EasyRef outperforms existing methods, producing more visually appealing and consistent images while working effectively across various visual styles without extra customization.
Link To Code: https://github.com/TempleX98/EasyRef
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: personalized image generation, diffusion model, multimodal large language models
Submission Number: 10520
Loading