Customizing Text-to-Image Generation with Inverted Interaction

mengmeng Ge; Xu Jia; Takashi Isobe; Xiaomin Li; Qinghe Wang; Jing Mu; Dong Zhou; liwang Amd; Huchuan Lu; Lu Tian; Ashish Sirasao; Emad Barsoum

Customizing Text-to-Image Generation with Inverted Interaction

mengmeng Ge, Xu Jia, Takashi Isobe, Xiaomin Li, Qinghe Wang, Jing Mu, Dong Zhou, liwang Amd, Huchuan Lu, Lu Tian, Ashish Sirasao, Emad Barsoum

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Subject-driven image generation, aimed at customizing user-specified subjects, has experienced rapid progress. However, most of them focus on transferring the customized appearance of subjects. In this work, we consider a novel concept customization task, that is, capturing the interaction between subjects in exemplar images and transferring the learned concept of interaction to achieve customized text-to-image generation. Intrinsically, the interaction between subjects is diverse and is difficult to describe in only a few words. In addition, typical exemplar images are about the interaction between humans, which further intensifies the challenge of interaction-driven image generation with various categories of subjects. To address this task, we adopt a divide-and-conquer strategy and propose a two-stage interaction inversion framework. The framework begins by learning a pseudo-word for a single pose of each subject in the interaction. This is then employed to promote the learning of the concept for the interaction. In addition, language prior and cross-attention loss are incorporated into the optimization process to encourage the modeling of interaction. Extensive experiments demonstrate that the proposed methods are able to effectively invert the interactive pose from exemplar images and apply it to the customized generation with user-specified interaction.

Primary Subject Area: [Generation] Generative Multimedia

Secondary Subject Area: [Generation] Generative Multimedia

Relevance To Conference: Subject-driven image generation, aiming for customizing user-specified subjects, has made rapid progress. In this work, we consider a novel concept customization task, that is, capturing the interactive pose between subjects in exemplar images and transferring the learned concept of interaction to achieve customized text-to-image generation. In short, our goal is to learn two pseudo-words for the pose of each subject in the interaction and a pseudo-word for the concept for the interactive pose. Once the embeddings of these three pseudo-words are optimized, they can be inserted into any description for text-to-image generation, which is multimedia processing.

Supplementary Material: zip

Submission Number: 1266

Loading