Navigating Alignment Pitfalls: Assessing Suggestions to Combat Sycophancy

ACL ARR 2024 June Submission4983 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Sycophancy causes models to produce answers that cater to user expectations rather than providing truthful responses. Previous research has found that model scaling, instruction tuning, and human feedback may increase sycophancy. However, these studies primarily focused on closed-source models and used indirect analysis to demonstrate the influence of human feedback. Our study focuses on sycophancy in open-source models, which are commonly used for specialized domain applications. We investigated the impact of human feedback on sycophancy by directly comparing models aligned with human feedback to those not aligned. To address sycophancy, we proposed assessing the user's expected answer rather than ignoring it. Consequently, we developed the Assessing Suggested Answer Preferences (ASAP) dataset and demonstrated that ASAP can enhance the model's assessment ability and reduce sycophancy across tasks.
Paper Type: Short
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: sycophancy,alignment,truthfulness
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 4983
Loading