Abstract: Offline reinforcement learning (RL) recovers the optimal policy $\pi$ given historical observations of an agent. In practice, $\pi$ is modeled as a weighted version of the agent's behavior policy $\mu$, using a weight function $w$ working as a critic of the agent's behavior. Though recent approaches to offline RL based on diffusion models (DM) have exhibited promising results, they require training a separate guidance network to compute the required scores, which is challenging due to their dependence on the unknown $w$. In this work, we construct a diffusion over both the actions and the weights, to explore a more streamlined DM-based approach to offline RL. With the proposed setting, the required scores are directly obtained from the diffusion model without learning additional networks. Our main conceptual contribution is a novel exact guidance method, where guidance comes from the same diffusion model; therefore, our proposal is termed Self-Weighted Guidance (SWG). Through an experimental proof of concept for SWG, we show that the proposed method i) generates samples from the desired distribution on toy examples, ii) performs competitively against state-of-the-art methods on D4RL when using resampling, and iii) exhibits robustness and scalability via ablation studies.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: The updated manuscript includes the following changes:
1) Improved experiments using resampling within the proposed SWG. These experiments now position our proposal as one of the top two performing methods in all but one of the considered datasets (Table 3).
2) A new diagram (Figure 1) to complement the presentation of our approach.
3) A revision of the text for clarity, as suggested by the reviewers.
4) A sharper presentation of the paper's aims, emphasizing the exploratory component of our work and its competitive results.
The text has been completely proofread to ensure clarity.
Changes in blue correspond to comments by reviewers Su2u and Upva, while changes in orange correspond to reviewer rDyL. This is because the feedback from reviewer rDyL pertained to the version that incorporated the changes suggested by reviewers Su2u and Upva.
Assigned Action Editor: ~Stefan_Lee1
Submission Number: 5708
Loading