Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

Adarsh Kumarappan; Ananya Mujoo

Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

Adarsh Kumarappan, Ananya Mujoo

Published: 27 May 2026, Last Modified: 16 Jun 2026CompLearn 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multi-agent systems, compositionality, mechanistic interpretability, sycophancy, LLM safety, activation patching, compositional vulnerability, agentic AI, sparse autoencoders, feature suppression, multi-agent debate, structured dissent, RLHF, robustness, compositional generalization

Abstract: Multi-agent LLM pipelines exhibit a *compositional vulnerability*: a shared mid-layer circuit (L14-L18) composes two independently varying input features (channel framing and consensus strength) to produce a structured family of failure modes that flip correct answers to incorrect at rates of 44-98\%. The vulnerability is pretrained, not RLHF-induced: base models across four families show the same substitution pattern as their Instruct variants, and alignment partially mitigates rather than causes it. Activation patching localizes the corruption to L14-L18, where attention carries the causal weight and MLP contribution is negligible; patching above this window restores 96\% of the clean-to-pressured P(correct) gap. The two-factor interaction produces a 47.5 percentage-point yield gap at majority consensus, preserved across jury sizes $N \in \lbrace 4, 5, 6 \rbrace$. Two converging activation-space interventions show that pressure suppresses clean-reasoning features rather than activating a new sycophancy circuit. A single correctly-arguing dissenter reduces yield by 54-73 percentage points across all framings because it targets the consensus factor the shared mechanism gates on; the strongest prompt-level defense fails on compositions outside its design surface. Compositionally aware mitigations, specifically structured dissent at the pipeline level, generalize where prompt-level defenses do not.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 155

Loading