Does Moral Reasoning Training Help or Hurt? Red-Teaming RL-Trained Ethical Agents with Persona Attacks

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0
Keywords: moral alignment, reinforcement learning, persona attacks, jailbreak robustness, PPO, reasoning-level rewards, mechanistic interpretability, activation steering, attention head ablation, CKA, adversarial training, LLM agents
TL;DR: Moral RL training makes LLM agents 5.2× more persona-attack-robust via a rank-1 steerable direction, but Fiction role-play (esp. named characters like Lord Varys) breaks through because compliance heads outnumber alignment heads.
Abstract: Moral-reward RL can make language-model agents more cooperative, but its robustness under adversarial persona pressure is unclear. Persona attacks are realistic for agents because retrieved context, tool outputs, or multi-turn framing can inject role instructions that compete with moral objectives. We red-team morally trained Gemma-2-27B/9B and Llama-3.1-8B agents with five persona attacks, then test causality with noise-reward controls, adversarial PPO, representation analysis, steering, and head ablations. At 27B, moral RL reduces mean adversarial degradation by 5.2× but costs ~11pp ETHICS accuracy; on 205 scenarios with 5 seeds, reasoning-level moral reward yields 5.8× robustness while matched random reward yields none. Moral training also shifts representation geometry (mean CKA 0.82/0.83 vs. 0.98 for noise), moves peak attack processing 8 layers earlier, and exposes a rank-1 L21 direction that recovers 83% of full PPO's average robustness. Fiction role-play remains the residual failure mode: L21 steering recovers only 29% of the Fiction gap, and head ablation identifies 38 compliance heads competing with 25 alignment heads. Moral RL creates partly linear, partly circuit-distributed robustness that can transfer through activation steering, but named-character role-play remains hard.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 522
Loading