Keywords: Sycophancy, benchmarking, evaluation
Abstract: Large Language Models (LLMs) often display sycophancy—a tendency to agree with or flatter users regardless of factual accuracy. While overt sycophancy is frequently exaggerated and thus noticeable, more subtle forms such as hedging, biased phrasing, or polished formatting can be far harder to detect. These behaviors are concerning because they may silently undermine user trust and distort decision-making, yet existing benchmarks treat sycophancy as a single phenomenon and overlook such nuance. In this work, we introduce SycophancyBench, the first benchmark explicitly designed to disentangle overt from subtle sycophancy. Our dataset spans multiple domains including factual QA, opinions, decision-making, and safety, with paired responses capturing factual, overtly sycophantic, and subtly sycophantic behaviors under varied stylistic conditions. We provide standardized evaluation dimensions—faithfulness, sensitivity to sycophancy, trust calibration, and style robustness—enabling systematic analysis of detection thresholds where humans and evaluation models fail to notice subtle sycophancy. Beyond measurement, we propose a dual-objective reward framework that encourages truthfulness and politeness while penalizing sycophantic tendencies. Together, our contributions establish a principled foundation for understanding how nuanced sycophancy affects trust and for developing models that remain both polite and genuinely faithful.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, evaluation, NLP datasets
Contribution Types: Data resources
Languages Studied: English
Submission Number: 6330
Loading