Offline Federated Deep Reinforcement Learning with Awareness of Expected Returns and Policy Inconsistency

Meng XU; Xinhong Chen; Zhongying Chen; Shuguang Wang; Jianping Wang

Offline Federated Deep Reinforcement Learning with Awareness of Expected Returns and Policy Inconsistency

Meng XU, Xinhong Chen, Zhongying Chen, Shuguang Wang, Jianping Wang

16 Sept 2025 (modified: 02 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Federated Deep Reinforcement Learning; Offline Deep Reinforcement Learning

TL;DR: This paper proposes an offline federated deep reinforcement learning framework that evaluates the capabilities of client models and the global model by combining policy inconsistency and expected return.

Abstract: Offline Federated Deep Reinforcement Learning (FDRL) methods aggregate multiple client-side offline Deep Reinforcement Learning (DRL) models, each trained locally, to facilitate knowledge sharing while preserving privacy. Existing offline FDRL methods assign client weights during global aggregation using either simple averaging or Q-values, but they neglect the combined consideration of Q-values and policy inconsistency, the latter of which reflects the distributional discrepancy between the learned policy and the policy from offline data. This causes clients with no significant advantages in one aspect but obvious disadvantages in the other to disproportionately affect the global model, thereby degrading its capabilities in that aspect. During local training, clients in existing methods are compelled to fully adopt the global model, which negatively impacts clients when the global model is weak. To address these limitations, we propose a novel federated learning framework that can be seamlessly integrated into current offline FDRL approaches to improve their performance. Our method considers both policy inconsistency and Q-values to determine the weights of client models, with the latter adjusted by a scaling factor to align with the magnitude of the former. The aggregated global model is then distributed to clients, who minimize the discrepancy between their models and the global one. The impact of this discrepancy is reduced if the client’s model ability exceeds that of the global model, mitigating the effect of a weaker global model. Experiments on the Datasets for Deep Data-Driven Reinforcement Learning (D4RL) demonstrate that our method enhances four state-of-the-art (SOTA) offline FDRL methods in terms of return and D4RL score.

Primary Area: reinforcement learning

Submission Number: 6666

Loading