Keywords: Reproducibility, Compositional Generalization
Abstract: Reproducibility Summary
Scope of Reproducibility — We examine the reproducibility of compositional generalization results from the task of semantic parsing. We aim to reproduce results from [1], [2], and [3] and seek to verify the claims that 1. A model shouldn't be expected to perform well on non-synthetic datasets just because it performs well on SCAN [1], 2. The approaches from [1] and [2] meet or exceed baseline performance on compositional generalization tests, and 3. NQG-T5 [1] outperforms baselines on both synthetic and natural data. 4. NQG [1] performs well on the instances that it is able to generate a prediction, but it faces the barrier of not being able to generate predictions for all instances.
Methodology — We reuse the authors' code along with additional code to run extra experiments, and we re-implement scripts whose support is deprecated. Eight 32GB GPUs were used for experiments, with a detailed description in Section 3.3.
Results — Claim 1 is verified: the model with the highest performance on SCAN does not maintain its high performance on other datasets (Section 4.1). Claims 2 and 3 are verified, with a comparison of performance between NQG-T5 and the selected baseline models in [1] and [2]. Claim 4 is also verified by computing the coverage and precision of NQG in Section 4.4. Overall, accuracy for most experiments reaches within 2% of that reported in the original paper, with a deviation that our T5 achieves higher performance on some splits and slightly lower performance on one split than reported previously.
What was easy — All papers provide clearly‐written code and informative documentation, as well as lists of hyperparameters that are used for experiments. The papers also describe their approaches clearly, making the experimental workflow easy to follow.
What was difficult — The exact match evaluation metric is formulated somewhat differently across all three papers, leading to a non‐negligible value difference, as discussed in Section 5.2. We also had to re‐implement some training scripts because an original dependency is no longer supported. Moreover, some experiments are computationally expensive: [1] used TPUs for experiments, while our replication with GPUs take several days to train a single T5 model.
Communication with original authors — The authors of all three papers provided us with useful instruction to work with their methods and constructive feedback on the draft.
Paper Url: https://aclanthology.org/2021.acl-long.75/
Paper Venue: Other venue (not in list)
Venue Name: ACL2021
Supplementary Material: zip
Confirmation: The report pdf is generated from the provided camera ready Google Colab script, The report metadata is verified from the camera ready Google Colab script, The report contains correct author information., The report contains link to code and SWH metadata., The report follows the ReScience latex style guides as in the Reproducibility Report Template (https://paperswithcode.com/rc2022/registration)., The report contains the Reproducibility Summary in the first page., The latex .zip file is verified from the camera ready Google Colab script
Latex: zip
Journal: ReScience Volume 9 Issue 2 Article 44
Doi: https://www.doi.org/10.5281/zenodo.8173759
Code: https://archive.softwareheritage.org/swh:1:dir:f508ee8f31bd7a768d1fa09e7fedf834e663fd6e
0 Replies
Loading