Navigating Alignment Pitfalls: Assessing Suggestions to Combat Sycophancy

Navigating Alignment Pitfalls: Assessing Suggestions to Combat Sycophancy

ACL ARR 2024 June Submission4983 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Sycophancy causes models to produce answers that cater to user expectations rather than providing truthful responses. Previous research has found that model scaling, instruction tuning, and human feedback may increase sycophancy. However, these studies primarily focused on closed-source models and used indirect analysis to demonstrate the influence of human feedback. Our study focuses on sycophancy in open-source models, which are commonly used for specialized domain applications. We investigated the impact of human feedback on sycophancy by directly comparing models aligned with human feedback to those not aligned. To address sycophancy, we proposed assessing the user's expected answer rather than ignoring it. Consequently, we developed the Assessing Suggested Answer Preferences (ASAP) dataset and demonstrated that ASAP can enhance the model's assessment ability and reduce sycophancy across tasks.

Paper Type: Short

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: sycophancy,alignment,truthfulness

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 4983

Loading