Sample-Efficient Preference-Based Reinforcement Learning Using Diffusion Models

Jingjing Feng; Lucheng Wang; Alona Tenytska; Bei Peng

Sample-Efficient Preference-Based Reinforcement Learning Using Diffusion Models

Jingjing Feng, Lucheng Wang, Alona Tenytska, Bei Peng

Published: 01 Apr 2025, Last Modified: 29 Apr 2025ALAEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Preference-based Reinforcement Learning, Diffusion Models

Abstract: Preference-based reinforcement learning (RL) has shown great potential for scaling human feedback to deep RL settings, enabling agents to solve complex tasks without access to a pre-defined reward function. Many state-of-the-art preference-based RL methods use off-policy learning to allow the agent to reuse previously collected experiences to improve learning efficiency. However, passively reusing prior data can limit the generality since the dataset of recent experiences is typically limited and not sufficiently diverse. This data limitation problem, on the other hand, has recently been addressed by using diffusion generative models to upsample agent experiences in the context of offline and online RL. Inspired by this success, we introduce PRIDE: Preference-based Reinforcement learning using dIffusion moDEl, a novel approach that integrates diffusion models into preference-based RL to improve both sample and feedback efficiency. PRIDE continually trains a diffusion model to approximate the RL agent’s online behavioral distribution. The trained diffusion model then generates a large quantity of novel and diverse synthetic experiences, which are used to augment limited real data, enabling better generalization while reducing reliance on real data. We evaluate PRIDE on a variety of locomotion and robotic manipulation tasks. Empirical results demonstrate that PRIDE outperforms state-of-the-art preference-based RL method in most tasks tested and achieves comparable or superior performance with a 50% reduction in human feedback. The novel use of diffusion models in our approach presents a promising direction for improving sample and feedback efficiency in preference-based RL.

Type Of Paper: Full paper (max page 8)

Anonymous Submission: Anonymized submission.

Submission Number: 2

Loading