Abstract: We introduce $\texttt{DecompSR}$, decomposed spatial reasoning, a large benchmark dataset (over 5m datapoints) and generation framework designed to analyse compositional spatial reasoning ability. The generation of $\texttt{DecompSR}$ allows users to independently vary several aspects of compositionality, namely: productivity (reasoning depth), substitutivity (entity and linguistic variability), overgeneralisation (input order, distractors) and systematicity (novel linguistic elements). $\texttt{DecompSR}$ has been built procedurally in a manner which makes it is correct by construction, which is independently verified using a symbolic solver to guarantee the correctness of the dataset. $\texttt{DecompSR}$ is comprehensively benchmarked across a host of Large Language Models (LLMs) where we show that LLMs struggle with productive and systematic generalisation in spatial reasoning tasks whereas they are more robust to linguistic variation. $\texttt{DecompSR}$ provides a provably correct and rigorous benchmarking dataset with a novel ability to independently vary the degrees of several key aspects of compositionality, allowing for robust and fine-grained probing of the compositional reasoning abilities of LLMs.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=P81p2nTuvA¬eId=7OyjyMg9L7
Changes Since Last Submission: All changes in the submission have been added using blue text to be easily identified.
In summary we have:
- In Section 1 (Introduction), included further motivation for Decompsr and emphasised the priority of our contributions
- In Section 2 (Background), included a richer introduction to spatial reasoning, in particular spatial reasoning in text.
- in Section 3 (The DecompSR dataset), we have added another scenario to Figure 1, as per reviewer xPmb's request
- in Section 4, we have included more results from the appendix into the main paper (as per reviewer zsuh's request). In closer detail we have:
- Performed experiments with more models, including at least one reasoning model per experiment, as per reviewer xPmb's request.
- Included what was previously Section 4.6 (Natural language translation experiment) into what is now Section 4.4 (substitutivity experiment)
- Added explanations of what constitutes correct translations in the ASP-experiment in what is now Section 4.5 (Evaluating reasoning tasks symbolically), as per reviewer 24Wv's request.
- In Section 5 (Discussion) we have made what is now Figure 8 (previously Figure 9) more readable (as per reviewer FWw7's request). We have also included examples of all possible semantic errors in the ASP execution in relation to what is now Figure 9 (previously Figure 10), as per reviewer xPmb's request, with full results presented in Appendix K, Figure 12.
- We have included further error analysis in the ASP experiment in a new figure, Figure 10 (full results in Figure 12 in appendix K), studying the breakdown in translation errors from DecompSR stories to ASP facts.
- We have included an impact statement (as per reviewer FWw7's request) in a new section, Section 7.
- We have provided cost details in a new appendix, Appendix C.1 (as per reviewer xPmb's request).
- In Appendix D (Productivity) We have added further analysis on the sudden drops in performance between $k=1$ and $k=2$ experiments in Table 8 (as per reviewer zsuh's request ) .
- In Appendix E (Systematicity Experiment) we have included full results for the new models used in the systematicity experiment in Section 4.2 (Systematicity Experiment)
- In Appendix F (Substitutivity: Natural Language Translation Experiment) we present the full results for the multilingual experiments with the new reasoning model results included (Table 10)
- In Appendix H (Overgeneralisation Experiment) we include full results for all models used in the overgeneralisation experiment (Section 4.3).
- In Appendix I (Token Analysis for Natural Language Translation) we present the statistics of the number of tokens used in the multilingual experiment in Table 14 (as per reviewer zsuh's request on adding more information about the multilingual experiment).
- in Appendix J (Qualitative Analysis of Correct and Incorrect Predictions) we present sample outputs from our experiments (as per reviewer 24Wv's request).
- In Appendix K (Failure Modes) we have corrected the error in Figure 11 (previously Figure 10), as per reviewer 24Wv's recommendation.
Assigned Action Editor: ~Amit_Sharma3
Submission Number: 6833
Loading