[Re] Reproducibility Study of Behavior TransformersDownload PDF

Published: 02 Aug 2023, Last Modified: 02 Aug 2023MLRC 2022 OutstandingPaperHonorableMentionReaders: Everyone
Keywords: rescience c, machine learning, reinforcement learning, imitation learning, behavior cloning, transformer, pytorch, python
TL;DR: Reproducibility report of the paper "Behavior Transformers: Cloning $k$ modes with one stone"
Abstract: Scope of Reproducibility - In this work, we analyze the reproducibility of 'Behavior Transformers: Cloning $k$ modes with one stone'. In assessing the Behavior Transformer (BeT) model, we analyze its ability to generate performant and diverse rollouts when trained on data containing multi-modal behaviors, the relevance of each of its components, and its sensitivity to critical hyperparameters. Methodology - We use the open-source PyTorch implementation released by the authors to train and sample rollouts for BeT. However, the implementation does not include all the environments, evaluation metrics, or ablations studied in the paper. Consequently, we extend it by following the details in the paper and filling in the missing parts to have a complete pipeline and support all the experiments performed in this report. We conducted our experiments on an NVIDIA GeForce GTX 780 GPU, requiring 276 GPU hours to train our models. Results - Running the code released by the authors does not produce an evaluation of BeT according to the metrics reported in the paper. After extending the implementation with the proper evaluation metrics, we obtain results that support the main claims of the paper in a significant subset of the experiments but that also diverge in many of the actual values obtained. Therefore, we conclude that the paper is largely replicable but not readily reproducible. What was easy - It was easy to identify the main claims of the paper and the experiments supporting them. Moreover, thanks to the open-source implementation released by the authors, training the model and sampling rollouts were straightforward tasks. What was difficult - Setting up the development environment was hard due to dependencies not being pinned. Not having the code for evaluation metrics available hindered our efforts to achieve similar numbers. Assessing the sources of discrepancies in our numbers was also difficult, as training curves and model weights were not accessible. Communication with original authors - We communicated via email with the authors throughout the project. They provided clarifications and resources that helped us with our study. However, the communication was insufficient to reach a complete reproduction.
Paper Url: https://proceedings.neurips.cc/paper_files/paper/2022/hash/90d17e882adbdda42349db6f50123817-Abstract-Conference.html
Paper Review Url: https://openreview.net/forum?id=agTr-vRQsa
Paper Venue: NeurIPS 2022
Confirmation: The report pdf is generated from the provided camera ready Google Colab script, The report metadata is verified from the camera ready Google Colab script, The report contains correct author information., The report contains link to code and SWH metadata., The report follows the ReScience latex style guides as in the Reproducibility Report Template (https://paperswithcode.com/rc2022/registration)., The report contains the Reproducibility Summary in the first page., The latex .zip file is verified from the camera ready Google Colab script
Latex: zip
Journal: ReScience Volume 9 Issue 2 Article 43
Doi: https://www.doi.org/10.5281/zenodo.8173757
Code: https://archive.softwareheritage.org/swh:1:dir:4a562f75c0fd44672b806498e18b67690a5baabd
0 Replies