Keywords: offline reinforcement learning, model-selection, stability
TL;DR: we propose a method to adaptively weight methods in offline reinforcement learning based on relative conditional variances
Abstract: The goal of offline policy evaluation (OPE) is to evaluate target policies based on logged data under a different distribution. Because no one method is uniformly best, model selection is important, but difficult without online exploration. We propose soft stability weighting (SSW) for adaptively combining offline estimates from ensembles of fitted-Q-evaluation (FQE) and model-based evaluation methods generated by different random initializations of neural networks. Soft stability weighting computes a state-action-conditional weighted average of the median FQE and model-based prediction by normalizing the state-action-conditional standard deviation of ensembles of both methods relative to the average standard deviation of each method. Therefore it compares the relative stability of predictions in the ensemble to the perturbations from random initializations, drawn from a truncated normal distribution scaled by the input feature size.