Abstract: In this paper, we present Online Empirical Value Learning (ONEVaL), an `online' reinforcement learning algorithm for continuous MDPs that is `quasi-model-free' (needs a generative/simulation model but not the model per se) that can compute nearly-optimal policies and comes with nonasymptotic performance guarantees including prescriptions on required sample complexity for specified performance bounds. The algorithm relies on use of a `fully' randomized policy that will generate a β-mixing sample trajectory. It also relies on randomized function approximation in an RKHS for arbitrarily small function approximation error, and an `empirical' estimate of value from the next state by several samples of the next state from the generative model. We demonstrate its' good numerical performance on some benchmark problems. We note that the algorithm requires no hyper-parameter tuning, and is also robust to other concerns that seem to plague Deep RL algorithms.
0 Replies
Loading