Test-time Adapted Reinforcement Learning with Action Entropy Regularization

Shoukai Xu; ZihaoLian; Mingkui Tan; Liu Liu; Zhong Zhang; Peilin Zhao

Test-time Adapted Reinforcement Learning with Action Entropy Regularization

Shoukai Xu, ZihaoLian, Mingkui Tan, Liu Liu, Zhong Zhang, Peilin Zhao

Published: 01 May 2025, Last Modified: 11 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Offline reinforcement learning is widely applied in multiple fields due to its advantages in efficiency and risk control. However, a major problem it faces is the distribution shift between offline datasets and online environments. This mismatch leads to out-of-distribution (OOD) state-action pairs that fall outside the scope of the training data. Therefore, existing conservative training policies may not provide reliable decisions when the test environment deviates greatly from the offline dataset. In this paper, we propose Test-time Adapted Reinforcement Learning (TARL) to address this problem. TARL constructs unsupervised test-time optimization objectives for discrete and continuous control tasks, using test data without depending on environmental rewards. In discrete control tasks, it minimizes the entropy of predicted action probabilities to decrease uncertainty and avoid OOD state-action pairs. For continuous control tasks, it represents and minimizes action uncertainty based on the normal distribution of policy network outputs. Moreover, to prevent model bias caused by overfitting and error accumulation during the test-time update process, TARL enforces a KL divergence constraint between the fine-tuned policy and the original policy. For efficiency, TARL only updates the layer normalization layer parameters during testing. Extensive experiments on popular Atari game benchmarks and the D4RL dataset demonstrate the superiority of our method. Our method achieved a significant improvement over CQL, with a 13.6% episode return relative increase on the hopper-expert-v2 task.

Lay Summary: Offline reinforcement learning helps AI systems learn from pre-collected datasets (like past robot movements or game strategies) safely and efficiently. But when the real-world environment changes — like a robot encountering new obstacles or a game adding unexpected rules — the AI often struggles because it hasn’t seen these scenarios before. This mismatch can lead to unreliable or unsafe decisions. We designed a method called Test-time Adapted Reinforcement Learning (TARL) that lets the AI adjust itself during real-world use without needing explicit feedback. For tasks with clear choices (e.g., game controls), it reduces confusion by picking the most confident action. For complex tasks (e.g., robotic arm movements), it avoids risky moves by narrowing down possible actions. To prevent overcorrection, TARL limits how much the AI can deviate from its original safe training. Crucially, these updates happen efficiently — only tweaking a tiny part of the AI’s "brain" during testing. TARL improved performance over prior methods in robot tasks and video game benchmarks. By enabling safer and more flexible AI adaptation, it bridges the gap between offline training and real-world challenges without costly retraining.

Primary Area: Reinforcement Learning->Batch/Offline

Keywords: Reinforcement Learning, offline, test-time adaptation

Submission Number: 4519

Loading