TLCM: Training- efficient Latent Consistency Model for Image Generation with 2-8 Steps

Qingsong Xie; Zhenyi Liao; Chen Chen; Zhijie Deng; SHIXIANG TANG; Haonan Lu

TLCM: Training- efficient Latent Consistency Model for Image Generation with 2-8 Steps

Qingsong Xie, Zhenyi Liao, Chen Chen, Zhijie Deng, SHIXIANG TANG, Haonan Lu

27 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: latent diffusion model, consistency model, acceleration

TL;DR: We propose a novel Training-efficient Latent Consistency Model (TLCM) to tackle the challenges of expensive cost and the performance drop when sampling with few steps in large distilled latent diffusion models.

Abstract: Distilling latent diffusion models (LDMs) into ones that are fast to sample from is attracting growing research interest. However, the majority of existing methods face two critical challenges: 1) They need to perform long-time learning with a huge volume of real data. 2) They routinely lead to quality degradation for generation, especially in text-image alignment. This paper proposes the novel Training-efficient Latent Consistency Model (TLCM) to overcome these challenges. Our method first fast accelerate LDMs via data-free multistep latent consistency distillation (MLCD), then data-free latent consistency distillation is proposed to guarantee the inter-segment consistency in MLCD at low cost. Furthermore, we introduce bags of techniques to enhance TLCM's performance at rare-step inference without any real data, e.g., distribution matching, adversarial learning, and preference learning. TLCM demonstrates a high level of flexibility by allowing for adjustment of sampling steps within the range of 2 to 8 while still producing competitive outputs compared to full-step approaches. As its name suggests, TLCM excels in training efficiency in terms of both computational resources and data utilization. Notably, TLCM operates without reliance on a training dataset but instead employs synthetic data for the teacher itself during distillation. With just 70 training hours on an A100 GPU, a 3-step TLCM distilled from SDXL achieves an impressive CLIP Score of 33.68 and an Aesthetic Score of 5.97 on the MSCOCO-2017 5K benchmark, surpassing various accelerated models and even outperforming the teacher model in human preference metrics. We also demonstrate the versatility of TLCMs in applications including controllable generation, image style transfer, and Chinese-to-image generation.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10466

Loading