Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety

Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety

ACL ARR 2025 February Submission3618 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) require careful alignment to balance generalization, diversity, and safety. Existing studies focus on individual techniques or specific dimensions, lacking a holistic assessment of trade-offs. We propose a framework evaluating common alignment methods (PPO, DPO, ORPO, KTO) across five key dimensions using in-distribution and out-of-distribution datasets. Our findings provide insights into their trade-offs, guiding the development of more balanced and reliable LLMs.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: alignment,generalisation,diversity,LLM,safety,evaluation

Contribution Types: Model analysis & interpretability, Reproduction study

Languages Studied: English

Submission Number: 3618

Loading