Offline Reinforcement Learning via Tsallis Regularization

Lingwei Zhu; Matthew Kyle Schlegel; Han Wang; Martha White

Offline Reinforcement Learning via Tsallis Regularization

Lingwei Zhu, Matthew Kyle Schlegel, Han Wang, Martha White

Published: 06 May 2024, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Offline reinforcement learning (RL) focuses on learning a good policy from a fixed dataset. The dataset is generated by an unknown behavior policy through interactions with the environment and contains only a subset of the state-action spaces. Standard off-policy algorithms often perform poorly in this setting, suffering from errorneously optimistic values incurred by the out-of-distribution (OOD) actions not present in the dataset. The optimisim cannot be corrected as no further interaction with the environment is possible. Imposing divergence regularization and in-sample constraints are among the most popular methods to overcoming the issue by ensuring that the learned policy stays close to the behavior policy to minimize the occurrence of OOD actions. This paper proposes Tsallis regularization for offline RL, which aligns the induced sparsemax policies to the in-sample constraint. Sparsemax interpolates existing methods utilizing hard-max and softmax policies, in that only a subset of actions contributes non-zero action probability as compared to softmax (all actions) and hard-max (single action). We leverage this property to model the behavior policy and show that under several assumptions the learned sparsemax policies may have sparsity-conditional KL divergence to the behavior policy, making Tsallis regularization especially suitable for the Behavior Cloning methods. We propose a novel actor-critic algorithm: Tsallis Advantage Weighted Actor-Critic (Tsallis AWAC) generalizing AWAC and analyze its performance in standard Mujoco environments. Our code is available at \url{https://github.com/lingweizhu/tsallis_regularization}.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: We thank all reviewers and the action editor for their efforts and helpful feedback. We have uploaded our polished camera ready version and open-sourced our code at \url{https://github.com/lingweizhu/tsallis_regularization}.

Code: https://github.com/lingweizhu/tsallis_regularization

Supplementary Material: zip

Assigned Action Editor: ~Marc_Lanctot1

Submission Number: 2035

Loading