GASLIGHTBENCH: Quantifying LLM Susceptibility to Social Prompting

Xuanzhe Yao; Sahil Ghosh; Gareth Lee; William H. Logian; Lening Nick Cui; Ellie Podoshev; Swarit Srivastava; Michael Li; Aaron Sandoval; Sean O'Brien; Michael Saxon; Sunishchal Dev; Kevin Zhu

GASLIGHTBENCH: Quantifying LLM Susceptibility to Social Prompting

Xuanzhe Yao, Sahil Ghosh, Gareth Lee, William H. Logian, Lening Nick Cui, Ellie Podoshev, Swarit Srivastava, Michael Li, Aaron Sandoval, Sean O'Brien, Michael Saxon, Sunishchal Dev, Kevin Zhu

Published: 24 Sept 2025, Last Modified: 29 Nov 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmark, sycophancy, llms, dataset, ai safety, prompt manipulation, social cues, multi-turn dialogue, LLM-as-a-judge

TL;DR: GaslightBench: a plug-and-play benchmark (24,160 single-turn; 720 multi-turn) using social/linguistic modifiers to probe sycophancy, comprehensive analyses of ten modifier families across nine statement domains.

Abstract: Large language models (LLMs) can be manipulated by simple social and linguistic cues, producing sycophancy. We introduce GaslightBench, a plug-and-play benchmark that systematically applies socio-psychological and linguistic modifiers (e.g. flattery, false citations, assumptive language) to trivially verifiable facts to test model sycophancy. The dataset comprises a single-turn prompting section of 24,240 prompts spanning nine domains and ten modifier families, and a multi-turn prompting section of 720 four-turn dialogue sequences, evaluated via LLM-as-a-judge. Across a subset of 800 randomly sampled single-turn prompts and all 720 multi-turn dialogues, we find that state-of-the-art models consistently score highly in single-turn prompting (92%-98% accuracy) while multi-turn prompting results in highly varied accuracies ranging from ~60%-98%. We find that injecting bias into the model via a descriptive background induces the most sycophancy, decreasing accuracy by up to 23% in single-turn prompting. Across almost all the models we analyze, we also find a statistically significant difference in verbosity between sycophantic and non-sycophantic responses. GaslightBench standardizes stress tests of prompt-style susceptibility and identifies which social cues most undermine factual reliability. By treating LLMs as human-like socially influenced agents, we reveal novel methods to elicit sycophancy using common verbal techniques. This highlights a critical vulnerability, one that can be utilized to induce favorable responses that may be ethically disruptive or spread misinformation. We release our code and data at https://gaslightbench-web.vercel.app/

Submission Number: 226

Loading