An Investigation of Offline Reinforcement Learning in Factorisable Action Spaces

Published: 17 Nov 2024, Last Modified: 17 Nov 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Expanding reinforcement learning (RL) to offline domains generates promising prospects, particularly in sectors where data collection poses substantial challenges or risks. Pivotal to the success of transferring RL offline is mitigating overestimation bias in value estimates for state-action pairs absent from data. Whilst numerous approaches have been proposed in recent years, these tend to focus primarily on continuous or small-scale discrete action spaces. Factorised discrete action spaces, on the other hand, have received relatively little attention, despite many real-world problems naturally having factorisable actions. In this work, we undertake a formative investigation into offline reinforcement learning in factorisable action spaces. Using value-decomposition as formulated in DecQN as a foundation, we present the case for a factorised approach and conduct an extensive empirical evaluation of several offline techniques adapted to the factorised setting. In the absence of established benchmarks, we introduce a suite of our own comprising datasets of varying quality and task complexity. Advocating for reproducible research and innovation, we make all datasets available for public use alongside our code base: https://github.com/AlexBeesonWarwick/OfflineRLFactorisableActionSpaces
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=fm679EfNqc
Changes Since Last Submission: We would again like to extend our gratitude and appreciation to the previous Reviewers and Action Editor for taking the time to provide valuable feedback on our work. We have used this feedback to produce a revised version of our manuscript, which we hope addresses the two primary deficiencies of the previous submission. We detail these revisions below. The first main issue highlighted by reviewers was a lack of technical clarity, in particular our analysis of Q-value/utility value errors arising from function approximation. Following one reviewer's comments specifically, we have now re-framed our analysis in the context of Q-learning, bringing it more in line with previous work by Thrun \& Schwartz (1993) and Ireland \& Montana (2023). Our analysis now focuses on errors for target differences under DQN and DecQN and we put forward the case that such errors can be reduced by moving from the standard atomic representation of actions to one based on factorisation and decomposition. We point out why this is particularly beneficial in the offline setting and conduct simulations in support of our ideas in the Appendix. We also explicitly highlight the limitations of our analysis, in particular maintaining the assumption of previous work by Ireland \& Montana (2023) that the true Q-function decomposes additively (which one reviewer found to be overly restrictive). The second main issue with the previous submission was the lack of a naturally factorisable environment/task in our benchmarking suite and subsequent experimental evaluations. To address this, we have utilised a factorisable Maze environment from previous work by Chandak et al. (2019) and added this to our suite and set of experiments. We have also made minor changes with respect to formatting, condensing the main body of work to 12-pages and moving related material to the Appendix.
Code: https://github.com/AlexBeesonWarwick/OfflineRLFactorisableActionSpaces
Assigned Action Editor: ~Marc_Lanctot1
Submission Number: 3242
Loading