Uncovering Factor-Level Preference to Improve Human-Model Alignment

Uncovering Factor-Level Preference to Improve Human-Model Alignment

ACL ARR 2025 February Submission7771 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) often exhibit tendencies that diverge from human preferences, such as favoring certain writing styles or producing verbose outputs. While crucial for improvement, identifying the factors driving these misalignments remains challenging due to existing evaluation methods' reliance on coarse-grained comparisons and lack of explainability. To address this, we introduce PROFILE (PRObing Factors of InfLuence for Explainability), a novel framework that uncovers and quantifies the influence of specific factors driving both human and model preferences. Using PROFILE, we analyze preferences across summarization, instruction-following, and document-based question-answering tasks, revealing a surprising discrepancy: while LLMs show poor alignment with human preferences in generation tasks, they demonstrate strong alignment in evaluation tasks. We demonstrate how leveraging factor-level insights and the identified generation-evaluation gap can be used to improve LLM alignment through multiple approaches, including fine-tuning with self-guidance. Our findings provide practical approaches for improving LLM alignment while opening new directions for research on factor-level analysis.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: human alignment, Large Language Model, explainability, generation, evaluation

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 7771

Loading