PsychoAlign: Interpretable AI Alignment through Negotiated Multi-Agent Architecture

Published: 10 Jun 2026, Last Modified: 10 Jun 2026LXAI @ ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI alignment, multi-agent systems, trustworthy machine learning, negotiation, interpretability
TL;DR: A tripartite Drive/Evaluator/Mediator multi-agent architecture that produces interpretable negotiation traces, enables alignment boundary detection via inter-agent tension, and outperforms baselines on MT-Bench (0.852).
Abstract: Current AI alignment approaches such as RLHF, Constitutional AI, and DPO embed safety as a monolithic constraint in a single model, offering limited interpretability when alignment decisions are genuinely contested. We propose PsychoAlign, a tripartite multi-agent architecture decomposing alignment into asymmetric roles: a Drive agent maximizing utility without ethical constraints, an Evaluator enforcing compliance against swappable moral frameworks, and a Mediator arbitrating via the A2A protocol. This decomposition enables a capability that monolithic systems structurally cannot provide: alignment boundary detection. The inter-agent tension T(r) automatically maps alignment difficulty; competing-values dilemmas generate mean T=0.360, the highest across domains, with test-retest reliability ICC = 0.891, confirming tension is a stable property of request content. On quality benchmarks, PsychoAlign achieves MT-Bench 0.852, exceeding Constitutional AI (0.792) and a role-ablated no-asymmetry baseline by +0.110, directly confirming that asymmetric decomposition adds measurable value beyond generic multi-model pipelines. We discuss limitations honestly, including a capability-safety tradeoff on adversarial benchmarks and a defense-mechanism selection gap that motivates future work. All code, prompts, and negotiation traces are publicly released.
Submission Category: Full Paper
Overaged Verification: Yes
Latin American Hispanic Heritage: Yes
Icml Proceedings Status: No
Submission Number: 1
Loading