Customizing Text-to-Image Generation with Inverted Interaction

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Subject-driven image generation, aimed at customizing user-specified subjects, has experienced rapid progress. However, most of them focus on transferring the customized appearance of subjects. In this work, we consider a novel concept customization task, that is, capturing the interaction between subjects in exemplar images and transferring the learned concept of interaction to achieve customized text-to-image generation. Intrinsically, the interaction between subjects is diverse and is difficult to describe in only a few words. In addition, typical exemplar images are about the interaction between humans, which further intensifies the challenge of interaction-driven image generation with various categories of subjects. To address this task, we adopt a divide-and-conquer strategy and propose a two-stage interaction inversion framework. The framework begins by learning a pseudo-word for a single pose of each subject in the interaction. This is then employed to promote the learning of the concept for the interaction. In addition, language prior and cross-attention loss are incorporated into the optimization process to encourage the modeling of interaction. Extensive experiments demonstrate that the proposed methods are able to effectively invert the interactive pose from exemplar images and apply it to the customized generation with user-specified interaction.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: Subject-driven image generation, aiming for customizing user-specified subjects, has made rapid progress. In this work, we consider a novel concept customization task, that is, capturing the interactive pose between subjects in exemplar images and transferring the learned concept of interaction to achieve customized text-to-image generation. In short, our goal is to learn two pseudo-words for the pose of each subject in the interaction and a pseudo-word for the concept for the interactive pose. Once the embeddings of these three pseudo-words are optimized, they can be inserted into any description for text-to-image generation, which is multimedia processing.
Supplementary Material: zip
Submission Number: 1266
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview