Joint Learning Between Reference Image and Text Prompt for Fashion Image Editing

Joint Learning Between Reference Image and Text Prompt for Fashion Image Editing

ICLR 2026 Conference Submission25354 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Fashion Image Editing, Diffusion model, Text-Guided Image Editing

Abstract: Fashion image editing is an essential tool for designers to visualize design concepts, aiming to modify the garment in an input fashion image while ensuring that other areas of the image remain unaffected. Existing methods primarily focus on images-based virtual try-on or text-driven fashion image editing, often relying on multiple auxiliary information including segmentation masks or dense poses. However, they struggle with error accumulation or high computational costs when performing try-on and editing simultaneously. In this work, we introduce a joint learning fashion image editing framework based on text prompts and reference images, named D$^2$-Edit. It aims at flexible, fine-grained editing including garment migration and attribute adjustments such as sleeve length, texture, color, and material via textual descriptions. Our proposed D$^2$-Edit consists of four key components: (i) \textbf{image degradation module}, which introduces controlled noise to facilitate the learning of the target garment concept and preserves the contextual relationships between the target concept and other elements; (ii) \textbf{image reconstruction module}, responsible for reconstructing both the fashion image and the reference image; (iii) \textbf{garment concept learning module} that encourages each text token (e.g., \textit{skirt}) to attend solely to the image regions corresponding to the target concept via cross-attention loss; and (iv) \textbf{concept editing direction identification module}, designed to enable flexible attribute adjustments like fabric, color, and sleeve length. Extensive comparisons, ablations, and analyses demonstrate the effectiveness of our method across various test cases, highlighting its superiority over existing alternatives.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 25354

Loading