Abstract: In this paper, we propose Offline Reinforcement Learning through Trajectory Clustering and Exclusive Regularisation (TraCER), a value regularisation framework that accounts for out-of-distribution (OOD) actions. Unlike most existing methods, which avoid direct reasoning about OOD regions due to their inherent difficulty, TraCER traces and delineates OOD regions in the action space, potentially non-convex, using a trajectory clustering-based behaviour cloning algorithm. This approach assumes that each trajectory in the offline dataset was rolled out by a single behaviour policy, an assumption commonly satisfied in practice when datasets are collected from distinct sources or agents. Conditioned on this delineation, we introduce a Bellman-type operator that constrains value estimates for OOD actions to a tight lower bound while leaving in-distribution action-value estimates unchanged. The resulting value function supports standard policy extraction procedures. Experiments on multiple offline RL benchmarks demonstrate that TraCER consistently outperforms existing approaches.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Alberto_Bietti1
Submission Number: 9256
Loading