Keywords: Risk Assessment, LLM Safety, Multi-Agent Debate, Task Planning, Cognitive Collaboration
Abstract: Large Language Models (LLMs) exhibit impressive reasoning capabilities but often suffer from Embodied Semantic Hallucinations—generating plans that are semantically fluent but physically unsafe due to a lack of grounded common sense. Existing safety alignment methods, such as RLHF or naive safety prompting, typically fall into a Safety-Utility Trade-off, resulting in severe over-rejection of benign household instructions. To address this, we propose MADRA (Multi-Agent Deliberation for Risk Awareness), a training-free cognitive architecture that mimics System-2 deliberation. MADRA introduces a meta-cognitive Critical Agent that evaluates peer debates using a structured argumentation framework derived from the Toulmin Model, effectively mitigating the ``herd mentality'' in multi-agent systems. We also introduce SafeAware-VH, a benchmark featuring adversarial safe instructions designed to probe agents' sensitivity to physical risks. Extensive experiments demonstrate that MADRA breaks the Pareto frontier, achieving over 90\% rejection of unsafe tasks while maintaining high utility, significantly outperforming standard Chain-of-Thought and single-agent reflection baselines.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Language Modeling
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 1823
Loading