Value Conditioned Policy Fine Tuning for Test Time Domain Adaptation

Harit Pandya; Ignas Budvytis; Rudra P. K. Poudel; Stephan Liwicki

Value Conditioned Policy Fine Tuning for Test Time Domain Adaptation

Harit Pandya, Ignas Budvytis, Rudra P. K. Poudel, Stephan Liwicki

Published: 10 Jun 2025, Last Modified: 11 Jul 2025PUT at ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, test time adaptation, cross domain

TL;DR: We fine-tune policies under domain shift at test time by conservatively updating Q-function within trust region of pre-trained Q-function, achieving competitive performance with 3.5× faster runtime on MuJoCo.

Abstract: Rapid cross-domain adaptation of learned policies is a key enabler for efficient robot deployment to new environments. Especially sim-to-real transfer remains a core challenge in reinforcement learning (RL), due to the unavoidable difference in world dynamics. While na\"ive policy updates with fine-tuning are unstable due to noisy gradients under domain shifts, other methods typically learn a new policy from scratch, relying on data points from the source and target domains using selective data sharing or reward shaping. However, neither approach is suitable for time-efficient policy adaptation or adaptation without access to an efficient simulator during deployment. On the other hand, we propose a value conditioned policy fine tuning that leverages the existing Q-function to estimate trust regions for a stable policy update. In practice, this can be achieved simply by combining gradients from the pre-trained and current Q-functions. We conduct extensive experiments on the MuJoCo dynamics adaptation benchmark for online adaptation, demonstrating competitive performance compared to existing state-of-the-art methods with over 3.5x times faster runtime.

Submission Number: 66

Loading