Keywords: LLMs, preference optimization, self-judgement, LLM-as-a-judge, supervised fine-tuning, multi-objective optimization, multi-task learning, supervised learning
Abstract: The alignment of Large Language Models (LLMs) is achieved through fine-tuning with human preference data, where preference optimization has become a critical part of the process. Many methods have scaled LLM performance by incorporating self-judgement, highlighting the importance of unifying LLM-as-a-judge with the alignment process. One such method, called Self-rewarding LLMs, iteratively samples new data from the model to improve alignment using self-judgement. Since this additional data is generated by the LLM, we argue that similar improvements can be achieved without new data. We propose a method that reuses alignment data in the form of a self-judgement classification task and defines a multi-objective optimization problem. Our self-judgement task is derived from a simple transformation of the primary alignment data, asking the LLM to select the superior response. It introduces no new data beyond the existing alignment data. Thus, we claim the improvements are due to positive interference between the two tasks. We focus on a direct preference optimization method called Odds-Ratio Preference Optimization (ORPO). We conduct a thorough study of linear scalarization on two objectives and introduce two alternative approaches that vary the emphasis on alignment versus self-judgement objectives. Our results on Mistral 7B indicate a promising direction for fine-tuning LLMs on multiple objectives, particularly for improving performance on related tasks without additional natural language data.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5498
Loading