Leveraging Knowledge Distillation to Mitigate Model Collapse

Ilya Statsenko; Nikita Andriyanov; Oleg Shishkin

Leveraging Knowledge Distillation to Mitigate Model Collapse

Ilya Statsenko, Nikita Andriyanov, Oleg Shishkin

27 Sept 2024 (modified: 25 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: computer vision, natural language processing, generative models, diffusion, vae, text summarization, model collapse, synthetic data, distillation

TL;DR: Knowledge distillation between model, trained on real data, and on synthetic data, improves the performance of the last one.

Abstract: Since the amount of data generated by neural networks on the Internet is growing rapidly due to widespread access to corresponding models, it is logical to inquire about the impact of this surge in synthetic data on the training of subsequent models that will utilize it during training. Previous work has demonstrated a concerning trend: models trained predominantly on synthetic data often experience a decline in performance, which can escalate to a complete loss of the ability to reproduce the initial distribution of real-world data. This phenomenon, now referred to as model collapse, highlights the potential pitfalls of over-reliance on synthetic datasets, which may lack the diversity and complexity inherent in genuine data. To address this issue, we propose a novel method that leverages the well-established technique of knowledge distillation. Our approach aims to mitigate the adverse effects of synthetic data by facilitating a more effective transfer of knowledge from high-performing teacher models to student model. By doing so, we seek to enhance not only the qualitative aspects—such as the richness and variability of the generated outputs—but also the quantitative metrics that gauge model performance. Through extensive experimentation, we demonstrate that our method improves the robustness and generalization capabilities of models trained on synthetic data, for instance, for DDPM enhancement is 68.8%, in terms of the FID metric, contributing to a more sustainable and effective use of synthetic datasets in machine learning applications.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11091

Loading