Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts

12 Sept 2025 (modified: 23 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, LLM Agent
Abstract: Large Language Model (LLM)-powered multi-agent systems (MAS) have rapidly advanced collaborative reasoning, tool use, and role-specialized coordination in complex tasks. However, reliability-critical deployment remains hindered by a systemic failure mode: **hierarchical compliance** under **instruction conflicts** (system–user, peer–peer), where agents misprioritize system-level rules in the presence of competing demands. Moreover, widely used macro-level metrics (e.g., pass@k) obscure these micro-level violations and offer little actionable guidance for remedy. In this work, we present a full-stack, three-stage framework: (1) **Diagnose** - *Contextualized Role Adherence Score* (CRAS), a query-wise, context-aware scoring metric that decomposes role adherence into four measurable dimensions; (2) **Localize** - attention drift analysis revealing that instruction conflicts are resolved by attention heads that are largely concentrated in middle layers; (3) **Align** - *Surgical Alignment of Instruction Layers (SAIL)*, which installs LoRA only on the localized focal layers and optimizes a token-weighted DPO-style preference objective that credits tokens by their focal attentional contribution. Across standard benchmarks and MAS frameworks, our surgical approach improves instruction hierarchy compliance (e.g., +5.60% with AutoGen on MedQA) without full-model finetuning. The code is available at [https://anonymous.4open.science/r/DLA-ICLR-6DF6/](https://anonymous.4open.science/r/DLA-ICLR-6DF6/).
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4364
Loading