Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety

ACL ARR 2025 February Submission3618 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) require careful alignment to balance generalization, diversity, and safety. Existing studies focus on individual techniques or specific dimensions, lacking a holistic assessment of trade-offs. We propose a framework evaluating common alignment methods (PPO, DPO, ORPO, KTO) across five key dimensions using in-distribution and out-of-distribution datasets. Our findings provide insights into their trade-offs, guiding the development of more balanced and reliable LLMs.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: alignment,generalisation,diversity,LLM,safety,evaluation
Contribution Types: Model analysis & interpretability, Reproduction study
Languages Studied: English
Submission Number: 3618
Loading