PRPO: Collaborative Online Policy Learning in Personalized RLHF

Arkadiy Vladimirov; Evgeny Gurov; Vladislav Smirnov; Alexey Gorbatovski; Boris Shaposhnikov; Viacheslav Sinii; Daniil Gavrilov; Vladimir Solodkin; Aleksandr Beznosikov

PRPO: Collaborative Online Policy Learning in Personalized RLHF

Arkadiy Vladimirov, Evgeny Gurov, Vladislav Smirnov, Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Daniil Gavrilov, Vladimir Solodkin, Aleksandr Beznosikov

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Federated Learning, Multi-Agent Reinforcement Learning, Personalization, Personalized Adaption

Abstract: Personalizing Large Language Models (LLMs) requires capturing user preferences without centralizing private data, prompting a multi-agent local fine-tuning setup. While on-policy algorithms, as applied to RLHF, are well-suited for preference modeling, their use remains fundamentally single-agent. We present Peer‑Referenced Policy Optimization (PRPO), an online policy‑gradient method that lets privacy‑constrained clients cooperate while keeping trajectories local. PRPO extends Proximal Policy Optimization (PPO) family and treats KL regularizer as a communication channel: each round, every client conditions its update on a composite reference policy created by peer‑to‑peer averaging of action distributions. This distribution‑level exchange preserves trust‑region stability and adds only modest overhead compatible with LoRA adapters. We provide theoretical support for PRPO through general observations and convergence guarantees under limited conditions. We evaluate PRPO on the set of ATARI and Minigrid challenges and in the standard RLHF summarization setting, where it surpasses local PPO - indicating that reference-policy sharing offers a practical path to scalable, privacy-preserving LLM personalization.

Primary Area: reinforcement learning

Submission Number: 12247

Loading