Keywords: LLMs, Multi-agent Systems, Medical Safety, Dark Personality Agents, Safety Evaluation, Agent Architecture, Agent Safety
TL;DR: A fine-grained evaluation reveals topology-specific safety gaps in medical LLM multi-agent systems and shows a lightweight defence can restore robustness.
Abstract: As large language models are increasingly adopted in healthcare, ensuring their safety is critical, particularly in collaborative multi-agent settings. This paper develops an end-to-end attack–defense evaluation workflow to systematically analyze how four representative multi-agent topologies (Layers, SharedPool, Centralized, and Decentralized) behave under attacks from ``dark-personality'' agents. To support the evaluation, we curate MedSentry, a data resource containing 5,000 adversarial medical prompts that span 25 threat topics and 100 subtopics. Our study reveals critical differences in how these architectures handle information contamination and maintain robust decision-making, exposing their underlying vulnerability mechanisms. For example, SharedPool is highly susceptible due to open information sharing, whereas Decentralized exhibits stronger resilience owing to inherent redundancy and isolation. To mitigate these risks, we propose a personality-scale detection and correction mechanism that identifies and rehabilitates malicious agents, restoring safety to near-baseline levels. Taken together, MedSentry provides a rigorous evaluation framework alongside actionable defense strategies, offering guidance for the design of safer LLM-based multi-agent systems in medical contexts.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 12106
Loading