Uncovering Factor-Level Preference to Improve Human-Model Alignment

ACL ARR 2025 May Submission8044 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) often exhibit tendencies that diverge from human preferences, such as favoring certain writing styles or producing verbose outputs. While crucial for improvement, identifying the factors driving these misalignments remains challenging due to existing evaluation methods' reliance on coarse-grained comparisons and lack of explainability. To address this, we introduce PROFILE, an automated framework to uncover and measure the alignment of factor-level preferences of humans and LLMs. Using PROFILE, we analyze preference alignment across summarization, instruction-following, and document-based question-answering tasks. We find a significant discrepancy: while LLMs show poor factor-level alignment with human preferences when generating texts, they demonstrate strong alignment in evaluation tasks. We demonstrate how leveraging the identified generation-evaluation gap can be used to improve LLM alignment through multiple approaches, including fine-tuning with self-guidance.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: human alignment, Large Language Model, explainability, generation, evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 8044
Loading