Abstract: The rise of generative models has transformed image generation and editing, enabling high-quality, user-guided outputs. Iterative face editing, essential for applications like virtual makeup and entertainment, allows users to refine images progressively. However, this process often leads to artifact accumulation, semantic inconsistency, and quality degradation over multiple edits. Existing methods, while effective in single-step modifications, struggle with sequential edits. To robustly maintain fidelity and consistency in iterative face editing across multiple sessions, we propose IterDiff, a training-free framework leveraging diffusion models with a novel Training-Free Feature Preservation (TF2P) approach to tackle these challenges by storing and retrieving key-value (KV) pairs from self-attention layers. Additionally, we further improve its efficiency and feasibility by Efficient CLIP-guided Memory Bank (ECMB). Experiments on the proposed benchmark show that IterDiff excels in prompt alignment, content consistency, and image quality, providing a robust solution for iterative facial attribute editing. Code, dataset and supplementary materials are available at https://github.com/david20571015/IterDiff.
Loading