Offline Reinforcement Learning Through Trajectory Clustering and Lower Bound Penalisation

Offline Reinforcement Learning Through Trajectory Clustering and Lower Bound Penalisation

ICLR 2026 Conference Submission18463 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Offline Reinforcement Learing, Value Regularisation

Abstract: In this paper, we propose a new framework for value regularisation for offline reinforcement learning (RL). While most previous methods evade explicit out-of-distribution (OOD) region identification due to its difficulty, our method explicitly identifies the OOD region, which can be non-convex depending on datasets, via a newly proposed trajectory clustering-based behaviour cloning algorithm. With the obtained explicit OOD region, we then define a Bellman-type operator pushing the value in the OOD region to a tight lower bound while operating normally in the in-distribution region. The value function with this operator can be used for policy acquisition in various ways. Empirical results on multiple offline RL benchmarks show that our method yields the state-of-the-art performance.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 18463

Loading