Progressively Label Enhancement for Large Language Model Alignment

Biao Liu; Ning Xu; Xin Geng

Progressively Label Enhancement for Large Language Model Alignment

Biao Liu, Ning Xu, Xin Geng

Published: 01 May 2025, Last Modified: 11 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLM) alignment aims to prevent models from producing content that misaligns with human expectations, which can lead to ethical and legal concerns. In the last few years, Reinforcement Learning from Human Feedback (RLHF) has been the most prominent method for achieving alignment. Due to challenges in stability and scalability with RLHF stages, which arise from the complex interactions between multiple models, researchers are exploring alternative methods to achieve effects comparable to those of RLHF. However, these methods often rely on large high-quality datasets. Despite some methods considering the generation of additional data to expand datasets, they often treat model training and data generation as separate and static processes, overlooking the fact that these processes are highly interdependent, leading to inefficient utilization of the generated data. To deal with this problem, we propose PLE, i.e., Progressively Label Enhancement for LLM Alignment, a framework that dynamically adjusts the model’s training process based on the evolving quality of the generated data. Specifically, we prompt the model to generate responses for both the original query and a set of carefully designed principle guided query, and then utilize a dynamic threshold to determine the appropriate training approach for both responses based on their corresponding reward scores. Experimental results demonstrate the effectiveness of PLE compared to existing LLM alignment methods.

Lay Summary: As AI becomes more powerful, it’s crucial to ensure that large language models (LLMs) behave in ways that align with human values and expectations. Traditionally, researchers have used a technique called Reinforcement Learning from Human Feedback (RLHF) to train these models to give helpful and safe responses. However, RLHF can be difficult to scale and improve efficiently. Our paper introduces a new method, called Progressively Label Enhancement, to make this training process smarter and more effective. Instead of treating training and data generation as two separate steps, our approach links them together. It lets the model improve by learning not only from standard examples but also from specially crafted, principle-guided questions. The model then decides—based on how well it performs—what kind of training it needs to get better. This dynamic, flexible method helps the model learn faster and more safely. Our experiments show that it outperforms existing approaches in aligning LLMs with human goals.

Link To Code: https://github.com/palm-biaoliu/PLE

Primary Area: Deep Learning->Large Language Models

Keywords: Language Model Alignment

Submission Number: 11080

Loading