JailMeter: An Evidence-Based Evaluation Framework for Jailbreak Attacks on Large Language Models

JailMeter: An Evidence-Based Evaluation Framework for Jailbreak Attacks on Large Language Models

ACL ARR 2026 January Submission4738 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Jailbreak Evaluation, Multi-Agent System, Multi-Agent Knowledge Distillation

Abstract: The assessment of jailbreak attacks against large language models currently suffers from inconsistent evaluation criteria and methods, leading to unreliable estimates of attack success rates. We propose JailMeter, an evidence-based evaluation framework designed to more faithfully measure jailbreak effectiveness. Inspired by the Information Bottleneck theory, JailMeter applies dual-feedback optimization to filter jailbreak noise from model responses while preserving content relevant to the original malicious question. This process produces concise evidence for a rigorous assessment under which an attack is validated only when the response captures the malicious intent and delivers a complete answer, thereby signaling a substantive bypass of model safety alignment. We evaluate JailMeter on JailMeter-Eva, a challenging benchmark containing 330 human-labeled, non-rejected jailbreak instances. JailMeter achieves an accuracy of 97.27\%, substantially outperforming existing evaluation methods. To support large-scale evaluation, we further distill JailMeter into a small language model, JailMeter\textsubscript{SLM}, which maintains comparable reliability with significantly reduced computational costs. Code and dataset are available at \url{https://anonymous.4open.science/r/JailMeter-383D}.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: Efficient/Low-Resource Methods for NLP, Ethics, Bias, and Fairness, Language Modeling

Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Submission Number: 4738

Loading