Evaluating In-Sample Softmax in Offline Reinforcement Learning: An Analysis Across Diverse Environments

TMLR Paper2256 Authors

17 Feb 2024 (modified: 01 Mar 2024)Under review for TMLREveryoneRevisionsBibTeX
Abstract: In this work, we considered the problem of learning action-values and corresponding policies from a fixed batch of data. The algorithms designed for this setting need to account for the fact that the action-coverage of the data distribution may be incomplete, that is certain state-action transitions are not present in the dataset. The core issue faced by Offline RL methods is insufficient action-coverage which leads to overestimation or divergence in learning during the bootstrapping update. We critically examine the In-Sample Softmax (INAC) algorithm for Offline Reinforcement Learning (RL), addressing the challenge of learning effective policies from pre-collected data without further environmental interaction using an in-sample softmax. Through extensive analysis and comparison with other in-sample algorithms like In-sample Actor-Critic (IAC) and Batch-Constrained Q-learning (BCQ) , we investigate INAC's efficacy across various environments, including tabular, continuous, and discrete domains, as well as imbalanced datasets. We find that the INAC, when benchmarked against state-of-the-art offline RL algorithms, demonstrates robustness to variations in data distribution and performs comparably, if not superiorly, in all scenarios. We do a comprehensive evaluation of the capabilities and the limitations of the In-Sample Softmax method within the broader context of offline reinforcement learning.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yu_Bai1
Submission Number: 2256
Loading