Playing Devil’s Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy
Keywords: Sycophancy, Persona Vectors, Activation Steering, Contrastive Activation Addition, Large Language Models, AI Alignment, Representation Engineering
TL;DR: Persona steering vectors (doubt, scrutiny) — never trained on sycophancy data — match 68–98% of CAA's effect, preserve accuracy when users are correct, and are geometrically independent of CAA. Sycophancy is a persona-level property.
Abstract: We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately and of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space. Collectively, these findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction. We release our code here: \url{https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/}.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 166
Loading