In this work, we reproduce and connect results from three papers \citep{shaw-etal-2021-compositional, kim2021sequencetosequence, kim-linzen-2020-cogs} that focus on compositional generalization in semantic parsing.
More specifically, we train and evaluate four models -- LSTM \cite{hochreiter1997long}, T5 \cite{raffel2020exploring}, Neural-QCFG \cite{kim2021sequencetosequence}, and NQG \cite{shaw-etal-2021-compositional} -- on different splits of synthetic and realistic datasets -- SCAN \cite{Lake2018GeneralizationWS}, GEOQUERY, SPIDER \cite{yu2018spider}, and COGS \cite{kim-linzen-2020-cogs} -- for evaluating compositional generalization.
Broadly speaking, we are interested in whether model performance on synthetic datasets aligns with performance on more realistic datasets.\footnote{Following \cite{shaw-etal-2021-compositional}, we deem a dataset \textit{synthetic} if its instances are heuristically generated by a program, we deem a dataset \textit{natural/realistic} if its instances are produced directly by humans.}
To this end, we aim to replicate \cite{shaw-etal-2021-compositional}, who find that performance of several pre-trained models on SCAN, a synthetic dataset, does not align with performance on GEOQUERY and SPIDER, whose instances are crafted by humans.
We then extend their findings in two ways.
On the data side, to increase the variation, we add experiments with an additional synthetic compositional generalisation dataset: COGS \cite{kim-linzen-2020-cogs}.
On the modelling side, on the other hand, we add two additional model architectures -- a vanilla LSTM (as used by \cite{kim-linzen-2020-cogs}) and the sequence-to-sequence (seq2seq) model Neural-QCFG proposed by \cite{kim2021sequencetosequence}, which incorporates parametrized grammars to capture hierarchical structure.
In doing so, we replicate also (part of) the results from \cite{kim2021sequencetosequence} and \cite{kim-linzen-2020-cogs}.

% Dieuwke: I rewrote / shortened a bit the introduction, the original second part of the intro is here below, commented out!
%We extend their finding by connecting the results across the papers, adding experiments on a vanilla LSTM, which is used in \cite{kim-linzen-2020-cogs}.
% Working towards an approach that handles both realistic and synthetic data, \cite{shaw-etal-2021-compositional} also introduced a new splitting method to evaluate compositional generalization in realistic datasets as well as a new model and its ensemble with T5 to handle natural language variation. 

%In addition to \cite{shaw-etal-2021-compositional}, we reproduce part of the results from \cite{kim2021sequencetosequence}, which proposes a sequence-to-sequence (seq2seq) model that incorporates parameterized grammar to capture hierarchical structure. 
% Including this model means we have more variation in modeling choices than \cite{shaw-etal-2021-compositional}, which focused only on T5 and NQG.
% Both \cite{shaw-etal-2021-compositional} and \cite{kim2021sequencetosequence} only used SCAN as the synthetic dataset.
%To examine whether the same observations are shared across different synthetic datasets, we added more variation in the dataset choice as well, by also replicating results from \cite{kim-linzen-2020-cogs}, which introduces a synthetic compositional generalization dataset, COGS, and we additionaly fine-tuned T5 on COGS.

% In this work, we compare the reproducibility of compositional generalization results on the task of semantic parsing. We aim to reproduce the corresponding results from \cite{shaw-etal-2021-compositional}, \cite{kim2021sequencetosequence}, and \cite{kim-linzen-2020-cogs}. We train LSTM, T5, Neural-QCFG, and NQG on several compositional generalization datasets (SCAN, GEOQUERY, SPIDER, and COGS), which are used in \cite{shaw-etal-2021-compositional}, \cite{kim2021sequencetosequence}, and \cite{kim-linzen-2020-cogs}, assessing the consistency of performance reproducibility across models.

% Shaw et al contribution: 
% - Demonstration that for n existing approaches, generalisation on SCAN is not well-correlated with performance on non-synthetic tasks
% - Introduction of NQG-T5, a new architecture that can deal with both super compositional and natural data, which they test on SCAN, GeoQuery, and SPIDER
% - Proposes TMCD, a new way of splitting natural datasets into splits that evaluates compositional generalization

% Kim contribution
% - Proposes to use quasi-synchronous grammars (QCFC) for seq2seq tasks. Test their model on SCAN and some non-compositional datasets

% Kim & Linzen contribution
% - Propose a dataset COGS and test Transformers and LSTMs on it

% Question: how to best put this together?