Pluralistic On-Policy Self-Distillation

Published: 02 Jun 2026, Last Modified: 11 Jun 2026Pluralistic-Alignment 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Alignment, Persona, On-Policy Self-Distillation
Abstract: Language feedback often contains multiple valid persona-dependent directions for improvement: a critique may ask a response to match the style of a professional advisor, a travel guide, or an artistic critic. This creates a challenge for pluralistic alignment, where distinct persona-specific feedback signals should be preserved rather than collapsed into a single reward or generic target. We propose Multi-Action-Head On-Policy Self-Distillation (MAH-OPSD), which combines persona-specific feedback with dense token-level on-policy distillation. For each prompt, MAH-OPSD first generates persona-specific rubrics to elicit more targeted critiques than generic feedback criteria. It then trains multiple persona action heads on a shared backbone: each head generates a response from the same prompt, receives its own rubric-guided critique, and distills from a critique-conditioned base model as its teacher. A lightweight router mixes the learned action heads based on the prompt, enabling adaptive response generation at inference time. We validate MAH-OPSD in two persona-faceted settings: a five-persona alignment task with rubric-guided critiques, and a multi-turn tutoring task with two teacher personas, where the per-turn feedback is the student's own reaction rather than an external critique. In both, the action heads specialize by persona and a learned router exposes them as a single adaptive policy, preserving distinct feedback pathways rather than merging all feedback into one generic policy.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 146
Loading