GRACE: A Generalizable Method For Multi-agent System Security Evaluation

16 Sept 2025 (modified: 28 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, Multi-agent System, Security
Abstract: This paper investigates the problem of security evaluation in multi-agent systems. Existing studies typically rely on the LLM-as-a-Judge paradigm or string-matching approaches. However, their performance remains unsatisfactory due to subjective criteria and hallucination in LLM judgment, as well as the non-transferability of pre-defined refusal-string databases. To address these limitations, this paper introduces a Generalizable Method for Multi-Agent System Security Evaluation (GRACE). The core of GRACE is to not only decouple rule construction and selection from an evaluation perspective, but also calculate distance-based threshold from a multi-agent system perspective, which enables the framework to effectively capture and quantify security risks in multi-agent interactions. In particular, our proposed GRACE first constructs an adaptive rule set from the query dataset and then selects the top-K similar rules with the highest cosine similarity to the input query. Each response is evaluated by an LLM with respect to each selected rule, producing a danger rating vector. Finally, GRACE computes the Euclidean distance between the rating vectors of the attacker and the agent, applying a threshold mechanism to assess the agent’s risk level within multi-agent systems. These three components are integrated into a unified process, enabling effective and generalizable security evaluation for multi-agent systems. We conduct extensive experiments on various benchmark datasets, and the results demonstrate that GRACE consistently outperforms existing baselines.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 6917
Loading