% Explain your approach - did you use the author's code, or did you aim to re-implement the approach from the description in the paper? Summarize the resources (code, documentation, GPUs) that you used.
We maximize our reuse of the code from the authors of the original papers: we reuse all the code from \cite{kim2021sequencetosequence} and \cite{kim-linzen-2020-cogs}; For \cite{shaw-etal-2021-compositional}, we reuse the code for NQG, and refactor the code for fine-tuning T5 into PyTorch with Huggingface Transformers,\footnote{https://github.com/huggingface/transformers} because the original T5 dependency \footnote{https://github.com/google-research/text-to-text-transfer-transformer} is no longer maintained, and PyTorch aligns better with the dependencies in the other two studies.
We address minor issues caused by versioning or sequence truncation during tokenization for each repository.
Finally, we refactor the code for \cite{shaw-etal-2021-compositional}, \cite{kim2021sequencetosequence}, and \cite{kim-linzen-2020-cogs} into a one-stop repository with a cleaned-up dependency and unified experimental scripts. 
For experiments on T5, we use eight Tesla V100 GPUs with 32GB CUDA memory each, and a single V100 or 16GB Quadro GP100 for the rest of the models. Section \ref{sec:comp-requirements} includes a detailed list of computational resources we use.
We list the datasets and the models below, with a summary of the experiments in Table \ref{tab:exp-list}.
\input{tables/contribution.tex}

\subsection{Datasets}

\input{tables/dataset_stat.tex}
% For each dataset include 1) relevant statistics such as the number of examples and label distributions, 2) details of train / dev / test splits, 3) an explanation of any preprocessing done, and 4) a link to download the data (if available).

The datasets we consider can be classified as either synthetic or realistic. 
Previous work on compositional generalization \citep{li-etal-2019-compositional, lake2019compositional, russin1904compositional, gordon2020permutation, liu2020compositional, chen2020compositional, nye2020learning} focused on modeling approaches that excel on synthetic datasets such as SCAN \cite{lake2018generalization}, while \cite{shaw-etal-2021-compositional} is motivated by the question of whether semantic parsing approaches can handle both synthetic and realistic data.

Each dataset we consider is divided into a training and a test set according to a different splitting strategy. 
For \textit{random} or \textit{standard} splits, the instances are assigned randomly to either the training or test set. 
For \textit{template} splits, instances that satisfy specific pattern will be isolated from the training set and can only appear in the test set. 
In \textit{length} splits, the instances with longer output (query length for SPIDER and GEOQUERY, command sequence length for SCAN) are allocated to the test set, and the remaining shorter instances comprise the training set. 
Maximum Compound Divergence (\textit{MCD}) is a splitting strategy introduced by \cite{keysers2020measuring} that maximizes compound divergence at a low atom divergence between train and test set. 
\textit{MCD} requires that both source and target be generated by a rule-based method, thus \cite{shaw-etal-2021-compositional} propose Target Maximum Compound Divergence (\textit{TMCD}) splits, which is comparable to \textit{MCD} but is also applicable to realistic datasets. In \cite{shaw-etal-2021-compositional}, the \textit{MCD} approach is applied on SCAN, while \textit{TMCD} is applied on GEOQUERY and SPIDER. Below are descriptive details for each dataset and their splits. Appendix \ref{appendix:examples} includes example instances from the datasets.

\vspace{-3mm}
\paragraph{COGS.} COGS is a \textbf{synthetic} semantic parsing dataset created for assessing compositional generalization \cite{kim-linzen-2020-cogs}.
The inputs are English sentences, generated by a Probabilistic Context-Free Grammar (PCFG). 
The corresponding output, which is the semantic interpretation of the input, is annotated with the logical formalism of \citet{reddy-etal-2017-universal} and enhanced with a couple of postprocessing procedures.
COGS provides four different sets: \textit{train, development, test}, and \textit{generalization} sets. The instances in the generalization set are created from separate PCFGs, while the other three contain instances constructed with the same PCFGs. Unlike the other datasets we used, COGS does not introduce additional splits beyond the \textit{generalization} split.

\vspace{-3mm}
\paragraph{SCAN.} SCAN is a \textbf{synthetic} dataset in which English commands are to be converted into sequences of prespecified actions. 
The actions are composed of simple movement designations such as ``JUMP" or ``TURN RIGHT". 
In addition to the \textit{random}, \textit{length}, \textit{template}, and \textit{MCD} split introduced above, two additional splits of SCAN from \cite{lake2018generalization} are used in \cite{shaw-etal-2021-compositional} and \cite{kim2021sequencetosequence}: 
%The splits used by the original paper are shown below:
\begin{itemize}[noitemsep,topsep=1pt,parsep=0.8pt,partopsep=0pt]
\item \textit{Add primitive (JUMP)} - The training set excludes the commands with the primitive ``JUMP"; the test set includes compositional commands that use it.
\item \textit{Add primitive (TURN LEFT)} - Similar to the prior split, this splitting method isolates all the ``TURN LEFT" commands in the training set.
\end{itemize}

\vspace{-3mm}
\paragraph{GEOQUERY.} GEOQUERY \citep{tang2001using, zelle1996learning} contains natural language questions about US geography. 
A model is fed the question and is expected to output the corresponding query, which can be executed to search a database.
\cite{shaw-etal-2021-compositional} convert all entity mentions with placeholders and used a variant, Functional Query Language (FunQL), as the target representation. In \cite{shaw-etal-2021-compositional}, four splits are constructed: \textit{standard}, \textit{template}, \textit{length}, and \textit{TMCD}, and are used for training the models.

\vspace{-3mm}
\paragraph{SPIDER.} SPIDER is a text-to-SQL dataset that spans multiple domains. SPIDER is originally designed for corss-domain semantic parsing and incorporates challenges to generalize to new database schemas by using different database in training and test set. It also possesses a more complex syntax of SQL. 
\cite{shaw-etal-2021-compositional} adopt a setting where databases are shared between train and test examples, so that the dataset splits can be dedicated to evaluating compositional generalization. Similar to GEOQUERY, the following splits are generated in \cite{shaw-etal-2021-compositional}: \textit{standard}, \textit{template}, \textit{length}, and \textit{TMCD}.

\subsection{Models}
%Include a description of each model or algorithm used. Be sure to list the type of model, the number of parameters, and other relevant info (e.g. if it's pretrained). 
\paragraph{NQG.}
To work towards a semantic parsing approach that can handle both compositional generalization and natural language variation, \citet{shaw-etal-2021-compositional} proposed an ensemble, NQG-T5, that chains a grammar-based model with a pre-trained seq2seq model.
The grammar-based component, made up of a \textbf{N}eural parsing model with \textbf{Q}uasi-synchronous \textbf{G}rammar induction, first induces a QCFG, then trains a discriminative latent variable parsing model to make derivations with the induced grammar. 
On instances for which NQG cannot provide a output, \citet{shaw-etal-2021-compositional} fall back on T5 \cite{raffel2020exploring} to make predictions.

\vspace{-3mm}
% \kaiser{Tentative TODO: Insert the prediction generation figure here}
\paragraph{Neural-QCFG.}
\citet{kim2021sequencetosequence} also use a quasi-synchronous grammar in the proposed approach, Neural-Quasy-Synchronous-Grammar QCFG (Neural-QCFG).
In contrast to \cite{shaw-etal-2021-compositional}, Neural-QCFG parameterizes the grammar's rule probabilities and treat the source and target trees as latent variables during training, and has no fall-back module.
Therefore, Neural-QCFG performs end-to-end generation and is easily applicable to a wide range of seq2seq tasks.

\vspace{-3mm}
\paragraph{T5.}
T5 \cite{raffel2020exploring} is an encoder-decoder Transformer \cite{vaswani2017attention} model that is pre-trained on multiple tasks. 
Each task is converted into a seq2seq format with a task-specific prefix, thus making it generally applicable to a variety of tasks.
\cite{shaw-etal-2021-compositional} use T5-base and T5-3B as both a baseline and a fallback model when NQG fails to produce a target.
Due to computational constraints, we will only be reproducing the results with T5-base.

\vspace{-3mm}
\paragraph{LSTM.}
Long Short-term Memory (LSTM) is a classical neural network that is widely used for a substantial amount of tasks. 
Because it does not contain any pre-trained knowledge, LSTM is an ideal candidate to provide a sense of how classical models perform on the compositional generalization datasets. \citet{kim-linzen-2020-cogs} train uni- and bi-LSTM on COGs, and we are able to reproduce their results of LSTM on COGS, and also to train LSTM on other datasets used in \cite{shaw-etal-2021-compositional} and \cite{kim2021sequencetosequence} for comparison.

\subsection{Experimental Setup}
% Include a description of how the experiments were set up that's clear enough a reader could replicate the setup. 
% Include a description of the specific measure used to evaluate the experiments (e.g. accuracy, precision@K, BLEU score, etc.). 
% Provide a link to your code.
We train/fine-tune the models on each dataset as specified above, and evaluate on the corresponding test set.\footnote{Our replication code can be found in \url{https://anonymous.4open.science/r/CompGen/}; all code will be made public upon acceptance.}
For COGS, we also evaluate models on the generalization set.
We use exact-match accuracy (EM) as evaluation metric. 
Note that because the vocabulary of T5 does not contain the ``$<$'' symbol, which appears in a large amount of instances, all UNK tokens in the output of T5 are considered as ``$<$'' during evaluation. For LSTM, we use five different random seeds and evaluate the averaged performance.

\vspace{-3mm}
\paragraph{Hyperparameters.}
% Describe how the hyperparameter values were set. If there was a hyperparameter search done, be sure to include the range of hyperparameters searched over, the method used to search (e.g. manual search, random search, Bayesian optimization, etc.), and the best hyperparameters found. Include the number of total experiments (e.g. hyperparameter trials). You can also include all results from that search (not just the best-found results).
% We employ the hyperparameters from the original papers. The experiments of T5 was conducted with a batch size of 256 on Cloud TPUs in the original paper. This cannot be satisfied by the computing resources that are available to us, thus we used a gradient accumulation step of 32 to achieve the same equivalent batch size. The full hyperparameter list for each model is in Table .

Following the original authors, we use a learning rate of $1.0 \times 10^{-4}$ and an equivalent batch size of 256 for experiments with T5.
We fine-tune for 2,400 steps on GEOQUERY and 10,000 steps on SPIDER. 
Because we do not have access to TPUs that were used in \cite{shaw-etal-2021-compositional}, we add in a gradient accumulation step of 16 to achieve the same equivalent batch size. We only optimize for 2,400 steps instead of 3,000 steps in GEOQUERY because no clear improvement in performance is observed after 1,000 steps. 
\cite{shaw-etal-2021-compositional} reported T5 results on SCAN from \cite{keysers2020measuring}, who used a different set of hyperparameters. 
In our experiments for T5 on SCAN, we conducted a minimal hyperparameter searching among setups from \cite{keysers2020measuring} and \cite{shaw-etal-2021-compositional}, with a commonly used learning rate of $1.0 \times 10^{-3}$, and arrive at a better performance with the hyperparameters from \cite{shaw-etal-2021-compositional}. 
Therefore, all the results on T5 in the next section follow the hyperparameters of \cite{shaw-etal-2021-compositional}, with an optimization steps of 4,550, the step size that we start to observe convergence.

For experiments with NQG, we use the original set up, except that we use BERT-Base model for SCAN and SPIDER, as opposed to \cite{shaw-etal-2021-compositional}'s BERT-Tiny model, because
the original BERT-Tiny model is no longer available in the Tensorflow model release\footnote{https://github.com/tensorflow/models} used by the original paper. 
For each trial, NQG is fine-tuned for 256 steps with a learning rate of $1.0 \times 10^{-4}$ and equivalent batch size of 256.
% 
For Neural-QCFG, we employed the same set of hyperparameters as \cite{kim2021sequencetosequence}, which includes an Adam optimizer \cite{kingma-ba-2015-adam} with learning rate of $5.0 \times 10^{-4}$, gradient norm clipping at 3, and a $L_2$ penalty of $10^{-5}$. 
With a batch size of 4, the model is trained for 15 epochs with early stopping on best performing checkpoint on validation set.
% 
For LSTMs, we adopt the hyperparameters from \cite{kim-linzen-2020-cogs}, which use a Noam learning rate scheduler \cite{vaswani2017attention}, with an initial learning rate of 2 and optimize for 30,000 steps. The equivalent batch size for LSTMs is 128.

\vspace{-3mm}
\paragraph{Computational Requirements.}
\label{sec:method:compreq}
\label{sec:comp-requirements}
% Include a description of the hardware used, such as the GPU or CPU the experiments were run on. 
Eight Tesla V100 GPUs, each with 32GB memory, are used for the experiments with T5. 
For experiments with smaller models such as Neural-QCFG and NQG, one V100 is used. A single 16GB Quadro GP100 is used for training the LSTM.
%
% For each model, include a measure of the average runtime (e.g. average time to predict labels for a given validation set with a particular batch size).
%
% For each experiment, include the total computational requirements (e.g. the total GPU hours spent).
% (Note: you'll likely have to record this as you run your experiments, so it's better to think about it ahead of time). Generally, consider the perspective of a reader who wants to use the approach described in the paper --- list what they would find useful.
We report the average GPU hours spent for training models on each dataset in Table \ref{tab:TrainingTime} in the Appendix.