Abstract: Customized Image Generation, generating customized images with user-specified concepts, has raised significant attention due to its creativity and novelty. With impressive progress achieved in subject customization, some pioneer works further explored the customization of action and interaction beyond entity (i.e., human, animal, and object) appearance. However, these approaches only focus on basic actions and interactions between two entities, and their effects are limited by insufficient ''exactly same'' reference images. To extend customized image generation to more complex scenes for general real-world applications, we propose a new task: event-customized image generation. Given a single reference image, we define the ''event'' as all specific actions, poses, relations, or interactions between different entities in the scene. This task aims at accurately capturing the complex event and generating customized images with various target entities. To solve this task, we proposed a novel training-free event customization method: FreeEvent. Specifically, FreeEvent introduces two extra paths alongside the general diffusion denoising process: 1) Entity switching path: it applies cross-attention guidance and regulation for target entity generation. 2) Event transferring path: it injects the spatial feature and self-attention maps from the reference image to the target image for event generation. To further facilitate this new task, we collected two evaluation benchmarks: SWiG-Event and Real-Event. Extensive experiments and ablations have demonstrated the effectiveness of FreeEvent.
Lay Summary: Imagine being able to change the people or objects in a photo, but still keep everything else — like their poses, actions, and how they interact — exactly the same. For example, replacing a person in a dance photo with someone else, while keeping the same dance move and background. This idea is called event-customized image generation, and it goes beyond just changing how someone looks — it focuses on keeping the “event” in the picture the same.
In our research, we introduce a new way to do this. Instead of needing many matching photos for reference, our method only needs one image to learn the key details of an event, such as who is doing what, how they are posed, and how they relate to others in the scene. Then, it can generate new images where the same event happens, but with different people or objects.
We call our method FreeEvent, and it doesn’t require extra training, which makes it more flexible and easier to use. It works in two main steps: one focuses on replacing people or objects while keeping their role in the scene, and the other copies the event details — like actions and positions — from the original photo to the new one.
We also created two new test sets to measure how well our method works. The results show that FreeEvent can generate realistic and detailed images that faithfully keep the original event, even when using completely new characters.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Applications->Computer Vision
Keywords: Customized Image Generation, Diffusion Model
Submission Number: 10268
Loading