JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring

ACL ARR 2026 January Submission3968 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language models, Jailbreak attacks, Evaluation methodology, Benchmark
Abstract: Accurately assessing jailbreak success remains unsolved, as existing protocols rely on misaligned proxy indicators or naive holistic evaluation strategies that often diverge from human perception. We propose JADES, a universal evaluation framework that decomposes a harmful query into weighted sub-questions, scores each sub-answer, and aggregates the scores into a final decision; an optional fact-checking module further mitigates hallucination-induced misjudgments. To validate JADES, we introduce \texttt{JailbreakQR}, a meticulously annotated benchmark of 400 prompt–response pairs. In binary settings (success/failure), JADES achieves $98.5%$ human agreement, outperforming strong baselines by over $9%$. Re-evaluating five widely used attacks across four LLMs reveals systematic overestimation. JADES also enables granular ternary evaluation (failure/partial success/success), uncovers that partial successes dominate current reported successes, challenging the perceived severity of existing attacks. Overall, JADES provides an accurate, consistent, and interpretable foundation for future LLM safety and jailbreak research.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: evaluation methodologies, automatic evaluation of datasets, benchmarking, safety and alignment
Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 3968
Loading