In this work, we reproduced the results of \cite{shaw-etal-2021-compositional}, \cite{kim2021sequencetosequence}, and \cite{kim-linzen-2020-cogs}.
We verified the claim that performance on SCAN does not entail the performance on non-synthetic datasets, and that Neural-QCFG achieves performance comparable with the baseline approaches. 
We also verified that NQG-T5 outperforms the baselines on both synthetic and natural data.
In addition to the results from the paper, we also find that T5 converges early with the training strategy in SPIDER and GEOQUERY.
We also highlight that the EM implementations used in the papers are different, and that this has consequences for the results, but we align the EM with each paper to ensure faithful reproduction.