Product2IMG: Prompt-Free E-commerce Product Background Generation with Diffusion Model and Self-Improved LMM
Abstract: In e-commerce platforms, visual content plays a pivotal role in capturing and retaining audience attention. A high-quality and aesthetically designed product background image can quickly grab consumers' attention, and increase their confidence in taking actions, such as making a purchase. Recently, diffusion models have achieved profound advancements, rendering product background generation a promising avenue for exploration. However, text-guided diffusion models require meticulously crafted prompts. The diverse range of products makes it challenging to compose prompts that result in visually appealing and semantically appropriate background scenes. Current work has made great efforts on creating prompts through expert-crafted rules or specialized fine-tuning of large language models, but it still relies on detailed human inputs and often falls short in generating desirable results by e-commerce standards.
In this paper, we propose Product2Img, a novel prompt-free diffusion model with automatic training data refinement strategy for product background generation. Product2Img employs Contrastive Background Alignment (CBA) for the text encoder to enhance the relevant background perception ability in the diffusion generation process, without the need for specific background prompts. Meanwhile, we develope the Iterative Data Refinement with Self-improved LMM (IDR-LMM), a framework that iteratively enhances the data selection capability of LMM for diffusion model training, thereby yielding continuous performance improvements.
Furthermore, we establish an E-commerce Product Background Dataset (EPBD) for the research in this paper and future work.
Experimental results indicate that our approach significantly outperforms current prevalent methods in terms of automatic metrics and human evaluation, yielding improved background aesthetics and relevance.
Primary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: Product2Img directly contributes to the cutting-edge research in multimodal processing by introducing an automated approach for generating product-specific backgrounds in e-commerce settings. Distinct from existing diffusion models, this model streamlines visual content creation, a crucial aspect of multimodal, by removing the reliance on detailed background prompts, thereby enhancing efficiency and accessibility. The self-improving LMM through natural language feedback further aligns with the conference themes of innovation and iterative advancement in multimodal technologies. As a result, Product2Img not only showcases a practical application in e-commerce but also represents a significant methodological innovation with implications for future multimodal research and development. The introduction of the EPBD dataset is an additional contribution that underscores the practical relevance and potential for ongoing exploration in the field, making this work highly pertinent to discussions and advancements in multimedia and multimodal processing at the conference.
Supplementary Material: zip
Submission Number: 1780
Loading