Keywords: AI alignment, multi-agent systems, trustworthy machine learning, negotiation, interpretability
TL;DR: A tripartite Drive/Evaluator/Mediator multi-agent architecture that produces interpretable negotiation traces, enables alignment boundary detection via inter-agent tension, and outperforms baselines on MT-Bench (0.852).
Abstract: Current AI alignment approaches such as RLHF, Constitutional AI, and DPO embed safety as a
monolithic constraint in a single model, offering limited interpretability when alignment
decisions are genuinely contested. We propose PsychoAlign, a tripartite multi-agent
architecture decomposing alignment into asymmetric roles: a Drive agent maximizing utility
without ethical constraints, an Evaluator enforcing compliance against swappable moral
frameworks, and a Mediator arbitrating via the A2A protocol. This decomposition enables a
capability that monolithic systems structurally cannot provide: alignment boundary
detection. The inter-agent tension T(r) automatically maps alignment difficulty;
competing-values dilemmas generate mean T=0.360, the highest across domains, with
test-retest reliability ICC = 0.891, confirming tension is a stable property of request
content. On quality benchmarks, PsychoAlign achieves MT-Bench 0.852, exceeding
Constitutional AI (0.792) and a role-ablated no-asymmetry baseline by +0.110, directly
confirming that asymmetric decomposition adds measurable value beyond generic multi-model
pipelines. We discuss limitations honestly, including a capability-safety tradeoff on
adversarial benchmarks and a defense-mechanism selection gap that motivates future work. All
code, prompts, and negotiation traces are publicly released.
Submission Category: Full Paper
Overaged Verification: Yes
Latin American Hispanic Heritage: Yes
Icml Proceedings Status: No
Submission Number: 1
Loading