SiDyP: Simplex Diffusion with Dynamic Prior for Denoising Llama-Generated Labels

Liqin Ye; Agam Shah; Chao Zhang; Sudheer Chava

SiDyP: Simplex Diffusion with Dynamic Prior for Denoising Llama-Generated Labels

Liqin Ye, Agam Shah, Chao Zhang, Sudheer Chava

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Model, Learning from Noisy Labels, Soft Labels

Abstract:

The traditional process of creating labeled datasets is not only labor-intensive but also expensive. Recent breakthroughs in open-source large language models (LLMs), such as Llama-3, have opened a new avenue in generating labeled datasets automatically for various natural language processing (NLP) tasks to provide an alternative to such expensive annotation process. However, the reliability of such auto-generated labels remains a significant concern due to inherent inaccuracies. When learning from such noisy labels, the model's generalization is likely to be harmed as it is prone to overfit those label noises. In this paper, we propose the \textbf{Si}mplex Diffusion with a \textbf{Dy}namic \textbf{P}rior (\textbf{SiDyP}) model to calibrate \hl{classifier's} predication, thus enhancing its robustness towards noisy labels. Our framework leverages simplex diffusion model to iteratively correct noisy labels conditioned on training dynamic trajectories obtained from classifier finetuning. \hl{The \textbf{P}rior in SiDyP refers} to the potential true label candidates which was obtained according to neighborhood label distribution in text embedding space. \hl{It is \textbf{Dy}namic because we} progressively distill these candidates based on the feedback of the diffusion model. Our SiDyP model can increase the performance of the BERT classifier fine-tuned on both zero-shot and few-shot Llama-3 generated noisy label datasets by an average of 5.33% and 7.69% respectively. Our extensive experiments, which explore different LLMs, diverse noise types (real-world and synthetic), ablation studies, and multiple baselines, demonstrate the effectiveness of SiDyP across a range of NLP tasks. We will make code and data publicly (under a CC BY 4.0 license) available on GitHub upon publication of the work.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8908

Loading