Taming Diffusion for Fashion Clothing Generation with Versatile Condition

Yanting Zhang, Jingyi Guo, Cairong Yan, Zhijun Fang

Published: 01 Jan 2024, Last Modified: 19 Feb 2025PRCV (5) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The intersection of art and artificial intelligence (AI) is rapidly evolving, particularly within the realm of fashion design, where AI’s potential to augment human creativity is being increasingly recognized. Despite the progress achieved through diffusion models in AI-assisted fashion design, the complexity of fashion items’ detailed information remains a significant hurdle in producing high-quality outputs. To address this, we propose S2CDiff, a diffusion-based framework designed for fashion Clothing generation with various Style conditions, such as texts, textures, sketches, masks, and colors. S2CDiff aims to achieve versatile image synthesis, which includes an efficient and lightweight conditional adapter to guide a pre-trained text-to-image diffusion model. By utilizing frozen CLIP, we extract texture patch features, preserving clothing texture characteristics by injecting them into the denoising process of the diffusion model via cross-modal attention with text embeddings. Furthermore, we curate a corresponding dataset of 13K fashion items with conditions to assess the efficiency, quality, and editability of the S2CDiff model, showcasing its remarkable performance. The qualitative and quantitative experimental results corroborate the effectiveness of our approach and underscore the model’s flexibility in controlling generated images, encompassing sketches, textures, and colors.