Distilling Robustness: Mitigating Persona Sensitivity in Language Models via RLVR Teacher-Student Training
Keywords: LLMs, Persona, Prompt, Distillation, Robustness
TLDR: This paper proposes distilling the robustness of an RLVR-trained teacher into a student model to build LLMs that remain stable across varying persona prompts.
Abstract: While persona prompting can boost Large Language Model (LLM) performance, prior work shows that finding the optimal persona is notoriously difficult, with its effects often being unpredictable and even detrimental. Current research has attempted to mitigate this volatility with prompt-level interventions, such as using costly inference-time ensembles to select the best output from a sensitive model. We argue for a paradigm shift: instead of searching for the best prompt for a volatile model, we should build an inherently robust model that is insensitive to persona variations. We hypothesize that persona sensitivity is not a random phenomenon but a systematic outcome of a model's alignment process. Our research reveals that models aligned with Reinforcement Learning with Verifiable Rewards (RLVR) are highly robust, unlike their sensitive, preference-optimized counterparts. Based on this insight, we propose leveraging knowledge distillation as a model-centric technique to transplant this robustness from an RLVR teacher to a student model. Experiments on mathematical reasoning benchmarks show that the distilled model inherits the teacher’s robustness, drastically reducing performance gaps across personas. Specifically, for the Qwen3 family, the average persona stability score increased by +0.23 on MATH500 and +0.32 on AIME2024, while for the Llama-8B models, the improvement reached +0.24 and +0.72, respectively.
Submission Number: 33
Loading