Keywords: Self-Alignment, Annotation-free, Large Language Model
Abstract: Traditional reinforcement learning from human feedback (RLHF) relies heavily on costly and time-consuming human-annotated datasets. Even Reinforcement Learning from AI Feedback (RLAIF), which trains a reward model using AI-generated preference data before refining the language model through reinforcement learning, remains expensive. These methods often necessitate either specialized reward model designs or larger models (e.g., GPT-4) for external labeling. In this paper, we introduce a dataset-free and annotation-free framework called Self-Alignment Optimization (SAO), which addresses the aforementioned issue by aligning the model with its own prompts and feedback as preferences. SAO begins with a chat-based model that engages in persona role-play to generate diverse prompts and responses, which are then self-evaluated and used for preference optimization.
Extensive experiments with two strong LLMs on several benchmarks demonstrate the effectiveness of SAO. Specifically, on AlpacaEval 2.0, Gemma-2-9B-it-SAO achieves a Length-Controlled Win Rate (LC) of 69.2\% and win rate (WR) of 66.0\%, surpassing the baseline model by 18.1\% and 27.9\%. Llama-3-Instruct-8B-SAO reaches 33.3\% LC and 39.0\% WR, with performance improvements of 10.4\% and 16.4\%, respectively. On the MT-Bench benchmark, Gemma-2-9B-it-SAO and Llama-3-8B-Instruct-SAO score 7.41 and 6.76, compared to their pre-SAO scores of 7.09 and 6.70. The Arena-Hard benchmark shows even greater gains from SAO, with Gemma-2-9B-it's WR increasing from 52.6\% to 70.1\% and Llama-3-Instruct-8B's WR rising from 40.3\% to 56.4\%. In addition, our further experiments demonstrate that models fine-tuned with SAO exhibit similar or even superior performance on downstream NLP tasks compared to baseline models, rather than those trained with external labeled datasets, which enhance alignment ability but may compromise some general capabilities. We anticipate that this work will provide new insights for future research on self-improvement in LLMs.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5684
Loading