Keywords: Reinforcement Learning, Federated Learning, Multi-Agent Reinforcement Learning, Personalization, Personalized Adaption
Abstract: Personalizing Large Language Models (LLMs) requires capturing user preferences without centralizing private data, prompting a multi-agent local fine-tuning setup. While on-policy algorithms, as applied to RLHF, are well-suited for preference modeling, their use remains fundamentally single-agent. We present Peer‑Referenced Policy Optimization (PRPO), an online policy‑gradient method that lets privacy‑constrained clients cooperate while keeping trajectories local. PRPO extends Proximal Policy Optimization (PPO) family and treats KL regularizer as a communication channel: each round, every client conditions its update on a composite reference policy created by peer‑to‑peer averaging of action distributions. This distribution‑level exchange preserves trust‑region stability and adds only modest overhead compatible with LoRA adapters. We provide theoretical support for PRPO through general observations and convergence guarantees under limited conditions. We evaluate PRPO on the set of ATARI and Minigrid challenges and in the standard RLHF summarization setting, where it surpasses local PPO - indicating that reference-policy sharing offers a practical path to scalable, privacy-preserving LLM personalization.
Primary Area: reinforcement learning
Submission Number: 12247
Loading