DeflectBench: A Benchmark for Evaluating Rhetorical Fallacy Generation in LLMs

Published: 25 May 2026, Last Modified: 25 May 2026CTB@ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM evaluation, LLM dataset, LLM benchmark, LLM safety, alignment, refusal behavior, misuse evaluation, rhetorical fallacies, political persuasion, compliance asymmetry
Abstract: Whether large language models can be prompted to generate rhetorical fallacies on demand, and whether current safety post-training constrains this behavior, has received less attention than the related question of detecting fallacies in existing text. We close this gap with DeflectBench, evaluating $23{,}990$ generations from four frontier models across three deflection strategies (whataboutism, ad hominem, red herring), seven prompt framings, and $80$ claims spanning four controversy levels. Refusal is governed primarily by request structure rather than claim content. Per claim refusal varies by only $11$ percentage points across the $80$ claims, while a single prompt-frame change can swing within-model refusal by nearly $100$ percentage points. An educational debate coach prompt framing collapses refusal to near zero across all four model families, but the bypassed behavior is not clean compliance. Models typically produce labeled compliance, naming the requested manipulation in the same response that contains it. The four models distribute differently across refusal, labeled compliance, soft refusal, and clean compliance, suggesting that alignment policies diverge across laboratories on rhetorical manipulation rather than converging on a single safety norm.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 139
Loading