Reproducibility study of the Value Equivalence Principle for Model Based Reinforcement LearningDownload PDF

Jan 31, 2021 (edited Apr 01, 2021)ML Reproducibility Challenge 2020 Blind SubmissionReaders: Everyone
  • Keywords: Value Equivalence Principle, Model-Based RL
  • Abstract: Reproducibility Summary Scope of Reproducibility \cite{grimm2020value} introduces and studies the concept of equivalence for Reinforcement Learning models with respect to a set of policies and value functions. It further shows that this principle can be leveraged to find models constrained by representational capacity, which are better than their maximum likelihood counterparts. Methodology The code for this project is closed sourced. Therefore, we re-implemented the three sets of experiments (including the baseline) and the authors' custom environments. All experiments were performed using Google Colab and required a total of 160 hours of Google Colab GPU. Results Since all the results in the original paper are presented in graphical form, we cannot provide precise numbers. For experiments with $\rm{span(\mathcal{V}) \approx \Ddot{\mathcal{V}}}$ , our results match the reported results. For experiments with $\rm{span(\mathcal{V}) \approx \tilde{\mathcal{V}}}$ and linear function approximation, our results for both the baseline and the author's method diverge from the reported graphs. For experiments with $\rm{span(\mathcal{V}) \approx \tilde{\mathcal{V}}}$ and neural networks, our results follow the reported trend, but not always with the same values. What was easy Even though we had to re-implement everything from scratch, the general pipeline for all experiments was straightforward and well described in the original paper. The environments used for all three experiments were reasonably straightforward. What was difficult \cite{grimm2020value} is a combination of theory and experiments. It is crucial to understand the theorems presented in the paper, requiring a solid knowledge of linear algebra. For experiments with linear function approximation, feature selection was made using k-means which depends heavily on centroids' initialisations. It takes a significant amount of time to repeatedly apply k-means on a large set of data to find the best fit (more than 10 hours for 10,000 initialisations on a dataset of 1,000,000). For these experiments, we also found that the given learning rate did not learn a good model (see figure \ref{LR_search}). Communication with original authors We contacted the author for multiple queries related to custom environments, hyper-parameters, feature selections and other minute experimental details via email. The author replied to all of them thoroughly and in a reasonable time.
  • Paper Url:
3 Replies