Abstract: Inserting an object into a background scene has wide applications in image editing and mixed reality. However, existing methods still struggle to seamlessly adapt the object to the background while maintaining its individual characteristics. In this paper, we propose to fine-tune a pre-trained diffusion-based insertion model such that it learns to establish a unique correspondence between a few weights and the target object, given as input few-shot images of an object. A novel individualized feature extraction (IFE) module is designed to extract the individual detail features from few-shot object images. Then, the individual features of the target object, together with the semantic features of the target object and the background context features extracted by the pre-trained image encoders are injected into the cross-attention modules of the latent diffusion model, enabling it to learn the correlation information of the target object and the background scene through the attention mechanism. The weights obtained by fine-tuning implicitly serve as an alternative representation of the target object, with which the object can be easily inserted into any background images. Extensive comparative experiments validate the superiority of the proposed method to the state-of-the-art insertion methods in maintaining the individual details of the inserted object and adapting it to background scenes, including allowing the interaction between the inserted object and the background scene, correctly handling their occlusion relationship, maintaining the consistency of their viewpoints and poses.
Loading