Keywords: LLM evaluation, LLM dataset, LLM benchmark, LLM safety, alignment, refusal behavior, misuse evaluation, rhetorical fallacies, political persuasion, compliance asymmetry
Abstract: Whether large language models can be prompted to generate rhetorical
fallacies on demand, and whether current safety post-training
constrains this behavior, has received less attention than the
related question of detecting fallacies in existing text. We close
this gap with DeflectBench, evaluating $23{,}990$ generations from
four frontier models across three deflection strategies
(whataboutism, ad hominem, red herring), seven prompt framings, and
$80$ claims spanning four controversy levels. Refusal
is governed primarily by request structure rather than claim
content. Per claim refusal varies by only $11$ percentage points
across the $80$ claims, while a single prompt-frame change can
swing within-model refusal by nearly $100$ percentage points. An
educational debate coach prompt framing collapses refusal to near zero
across all four model families, but the bypassed behavior is not
clean compliance. Models typically produce labeled
compliance, naming the requested manipulation in the same response
that contains it. The four models distribute differently across
refusal, labeled compliance, soft refusal, and clean compliance,
suggesting that alignment policies diverge across laboratories on
rhetorical manipulation rather than converging on a single safety
norm.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 139
Loading