XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content

XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content

ACL ARR 2026 January Submission8370 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI Safety, Adversarial Prompting, Safety alignment, Defense Framework

Abstract: Large Language Models (LLMs) can generate content spanning ideological rhetorics to explicit instructions for violence. However, existing safety evaluations often rely on simplistic binary labels (safe/unsafe), overlooking the nuanced spectrum of risk these outputs pose. To address this, we present XGUARD, a benchmark and evaluation framework designed to assess the severity of extremist content generated by LLMs on a multi-level grading. It includes 3,840 red-teaming prompts synthetically generated using templates informed by real-world extremist scenarios from social media, online forums, and news reporting, covering a broad range of ideologically charged scenarios and the framework categorizes model responses into five danger levels (0–4) defined by degree of extremist endorsement, enabling nuanced analysis of failure frequency and severity. We introduce the interpretable Attack Severity Curve (ASC) to visualize vulnerabilities and compare defense mechanisms across threat intensities. Using XGUARD, we evaluate five popular LLMs and two lightweight defense strategies, revealing key insights into current safety gaps and trade-offs between robustness and expressive freedom. Our work underscores the value of graded safety metrics for building trustworthy LLMs.

Paper Type: Short

Research Area: Resources and Evaluation

Research Area Keywords: AI Safety, Adversarial Prompting, Safety alignment, Defense Framework

Contribution Types: Data resources, Data analysis

Languages Studied: English

Submission Number: 8370

Loading