Probing Reasoning of Language Models with Inductive In-Context Learning

21 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Primary Area: learning theory
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Generalization, HPO, Model Comparison, Bayesian
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Despite their recent success, Language Models (LMs) have raised questions about whether they statistically repeat data (‘stochastic parrots’) or can learn the underlying generative process of the data. Current benchmarks used to probe the reasoning ability of LMs can be ambiguous, as it is unclear whether the model has memorized the instances in the benchmark or has learned the generative process 6 of the data. In this work, we introduce a novel evaluation setting that we use with Inductive In-Context Learning (IIL) and a dataset, ReAnalogy, to probe the reasoning ability of LMs. ReAnalogy consists of sequences with positive examples, generated from regular expressions (regex), and contains quasi-natural language. We use regex to evaluate implicitly whether a LM can infer ‘Rules’ (regex) given limited sets of examples (‘Facts’). After training the LM on Facts, we use it to generate additional Facts, to evaluate whether the generated Facts abide by the Rules. We evaluate a GPT model in our setting and compare with the same model where a Rule is injected during training to replace a Fact. We use IIL during evaluation to probe the model to infer the Rule given Facts. We then use the inferred Rule to synthesize an additional Fact. IIL improves ‘reasoning’ performance by as much as 33%. Our results suggest that LMs can learn more than statistical patterns in the data and we support our findings with ablation studies. We evaluate our dataset with existing benchmarks and baselines in inductive programming and find that current state-of-the-art symbolic or neurosymbolic approaches fail to the complexity of our dataset; while existing datasets and benchmarks in the Inductive Logic Programming domain are inapplicable for LMs. Our probing method and dataset are complex enough for LMs and applicable to evaluate the inductive reasoning abilities of LMs, while IIL can improve ‘reasoning’ of LMs. We use IIL to find that LMs can learn more than statistics in the data. We open-source our dataset and the code used for our experiments.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4087
Loading