Keywords: off-policy, on-policy, offline RL, adaptive training, large model post-traning
TL;DR: Batch-computed statistics replace hand-tuned knobs in both off-policy or on-policy fine-tuning with no new knob added.
Abstract: Reinforcement learning (RL) is structurally harder than supervised learning because the policy changes the data distribution it learns from. The resulting fragility is especially visible in large-model training, where the training and rollout systems differ in numerical precision, sampling, and other implementation details. Existing methods manage this fragility by adding more hyper-parameters to the training objective. Each may help in its tuned regime but makes the resulting algorithm more sensitive to its configuration, requiring retuning whenever the task, model scale, or distribution mismatch changes. This fragility traces to two concerns that current objectives entangle through hyper-parameters set before training begins. The first is a trust-region concern, in that each update should not move the policy too far from its current value. The second is an off-policy concern, in that data collected by older or different behavior policies should influence the current update only to the extent that the update remains reliable. Mishandling off-policy data is consequential, yet such data can still carry useful signal that must be weighted adaptively as training proceeds. Neither concern is a constant to set before training, and their severity is reflected in the policy-ratio distribution of the current batch. We present a simple yet effective batch-adaptive objective that replaces fixed clipping with the normalized effective sample size of the policy ratios. The same statistic caps the score-function weight and sets the strength of an off-policy regularizer. When the ratios are nearly uniform, the update stays close to the usual on-policy score-function update. When stale or mismatched data cause ratio concentration, the update tightens automatically while retaining a nonzero learning signal on high-ratio tokens. Experiments across a wide range of settings show that our method matches or exceeds tuned baselines, introducing no new objective hyper-parameters and removing several existing ones.
Submission Number: 149
Loading