Customizing Text-to-Image Generation with Inverted Interaction

Published: 01 Jan 2024, Last Modified: 14 Feb 2025ACM Multimedia 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Subject-driven image generation, aimed at customizing user-specified subjects, has experienced rapid progress. However, most of them focus on transferring the customized appearance of subjects. In this work, we consider a novel concept customization task, that is, capturing the interaction between subjects in exemplar images and transferring the learned concept of interaction to achieve customized text-to-image generation. Intrinsically, the interaction between subjects is diverse and is difficult to describe in only a few words. In addition, typical exemplar images are about the interaction between humans, which further intensifies the challenge of interaction-driven image generation with various categories of subjects. To address this task, we adopt a divide-and-conquer strategy and propose a two-stage interaction inversion framework. The framework begins by learning a pseudo-word for a single pose of each subject in the interaction. This is then employed to promote the learning of the concept for the interaction. In addition, language prior and cross-attention loss are incorporated into the optimization process to encourage the modeling of interaction. Extensive experiments demonstrate that the proposed methods are able to effectively invert the interactive pose from exemplar images and apply it to the customized generation with user-specified interaction.
Loading