Keywords: benchmark, sycophancy, llms, dataset, ai safety, prompt manipulation, social cues, multi-turn dialogue, LLM-as-a-judge
TL;DR: GaslightBench: a plug-and-play benchmark (24,160 single-turn; 720 multi-turn) using social/linguistic modifiers to probe sycophancy, comprehensive analyses of ten modifier families across nine statement domains.
Abstract: Large language models (LLMs) can be manipulated by simple social and logistic cues, producing sycophancy. We introduce GaslightBench, a plug-and-play benchmark that systematically applies socio-psychological and linguistic modifiers (e.g. flattery, false citations, assumptive language) to trivially verifiable facts to test model sycophancy. The dataset comprises a single-turn prompting section of 24,160 prompts spanning nine domains and ten modifiers families, and a multi-turn prompting section of 720 four-turn dialogue sequences, evaluated via LLM-as-a-judge. We find that state-of-the-art models consistently score highly in single-turn prompting (92%-98% accuracy) while multi-turn prompting results in highly varied accuracies ranging from ~60%-98%. We find that injecting bias into the model via a descriptive background induces the most sycophancy, up to 23% in naive single-turn prompting. Across almost all the models we analyze, we also find a statistically significant difference in verbosity between sycophantic and non-sycophantic responses. GaslightBench standardizes stress tests of prompt-style susceptibility and identifies which social cues most undermine factual reliability. We will release all code and data upon publication.
Submission Number: 226
Loading