Keywords: Offline-to-Online Reinforcement Learning, Offline RL, Online Fine-Tuning, Critic Ensembles, REDQ, Conservative Q-Learning, Safe Online Adaptation, Risk-Aware Reinforcement Learning
TL;DR: FOCUS reveals that offline-to-online RL checkpoints can hide transient high-footprint critic outliers that substantially destabilize online adaptation.
Abstract: Offline-to-online reinforcement learning often transfers conservative critic ensembles as intact initialization objects. In REDQ-style fine-tuning, however, online Bellman targets are constructed from randomly sampled critic heads, raising a checkpoint-level question: which conservatively pretrained heads should be trusted after transfer? We introduce FOCUS (Footprint-based Offline-to-online Critic Selection), a transition-time audit that scores each critic head by its conservative Bellman footprint and target exposure before online updates begin. Matched seed-0 AntMaze medium- and large-diverse checkpoint audits reveal that critic-head footprints are not static: heterogeneity ranges from nearly homogeneous to severe-outlier regimes, and the highest-footprint head changes across pretraining. This suggests that offline policy success alone can hide risk in the transferred critic target generator. Reduced-warmup experiments show a sharper lesson: forcing high-footprint heads into REDQ targets can substantially destabilize transfer at a high-heterogeneity checkpoint, while aggressive low-footprint half-ensemble selection is mixed. FOCUS therefore exposes a risk-diversity tradeoff in how conservative critic audits should be used, rather than merely proposing another fixed ensemble selector.
Submission Number: 150
Loading