Court Simulation with Multifaceted Agent Judgment for LLM Jailbreak Evaluation

19 Sept 2025 (modified: 25 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: jailbreak evaluation, llm multi-agent, llm jailbreak
TL;DR: We introduce CLEAR, a court-simulation framework where specialized LLM agents and a two-stage judgment process deliver clearer, more reliable evaluations of jailbreak attacks.
Abstract: Jailbreak attack aims to circumvent security mechanisms and elicit harmful responses to unsafe questions. Although numerous approaches to jailbreak attacks have been proposed, their evaluations remain inadequate due to limited clarity and comprehensiveness, which could result in inferior comparison. In this paper, we propose a novel framework named Court Simulation with Multifaceted Agent Judgment (CLEAR) for LLM jailbreak evaluation. The core of our CLEAR is to simulate a court process, ranging from first-instance judgment, statement generation, public debate, to final judgment. In particular, our CLEAR first generates comprehensive analyses using several LLM agents on the basis of retrieval, knowledge, harm score, and behavior tracer. Based on these analyses, the framework issues a first-instance judgment that includes confidence scores and summarized reasoning, which are then incorporated into a structured statement of claim. More importantly, CLEAR facilitates public debate among multiple LLM agents with second-instance trials to refine evaluations, which ensures accurate evaluations. Extensive experiments on benchmark datasets validate the effectiveness of the proposed CLEAR in comparison to existing protocols.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 16175
Loading