Abstract: Large language models (LLMs) require careful alignment to balance generalization, diversity, and safety. Existing studies focus on individual techniques or specific dimensions, lacking a holistic assessment of trade-offs. We propose a framework evaluating common alignment methods (PPO, DPO, ORPO, KTO) across five key dimensions using in-distribution and out-of-distribution datasets. Our findings provide insights into their trade-offs, guiding the development of more balanced and reliable LLMs.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: alignment,generalisation,diversity,LLM,safety,evaluation
Contribution Types: Model analysis & interpretability, Reproduction study
Languages Studied: English
Submission Number: 3618
Loading