Abstract: Offline reinforcement learning (RL) aims to learn an effective policy from a static dataset.
To alleviate extrapolation errors, existing studies often uniformly regularize the value function or policy updates across all states.
However, due to substantial variations in data quality, the fixed regularization strength often leads to a dilemma:
Weak regularization strength fails to address extrapolation errors and value overestimation, while strong regularization strength shifts policy learning toward behavior cloning, impeding potential performance enabled by Bellman updates.
To address this issue, we propose the selective state-adaptive regularization method for offline RL. Specifically, we introduce state-adaptive regularization coefficients to trust state-level Bellman-driven results, while selectively applying regularization on high-quality actions, aiming to avoid performance degradation caused by tight constraints on low-quality actions.
By establishing a connection between the representative value regularization method, CQL, and explicit policy constraint methods, we effectively extend selective state-adaptive regularization to these two mainstream offline RL approaches.
Extensive experiments demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in both offline and offline-to-online settings on the D4RL benchmark. The implementation is available at https://github.com/QinwenLuo/SSAR.
Lay Summary: AI agents can learn from past experience, but when trained only on existing data—a setting known as offline reinforcement learning—they often make mistakes due to overly optimistic decisions. Previous methods attempted to prevent such errors by enforcing strong constraints to make the agent behave closely to the data. However, applying the same level of constraint everywhere either hinders learning or exposes the agent to unnecessary risks.
Our research introduces an adaptive strategy that adjusts the strength of constraints based on how much the agent's behavior deviates from high-quality data. This allows the agent to benefit more from its own learning process. At the same time, we apply constraints only to a carefully selected subset of reliable data, helping the agent take full advantage of the reinforcement learning paradigm.
The proposed method can integrate seamlessly with widely used training techniques and enable a smooth transition from offline learning to online interaction.
Link To Code: https://github.com/QinwenLuo/SSAR
Primary Area: Reinforcement Learning->Batch/Offline
Keywords: offline reinforcement learning, selective state-adaptive regularization, offline-to-online reinforcement learning
Submission Number: 10490
Loading