Probing Reasoning of Language Models with Inductive In-Context LearningDownload PDF

Published: 16 Jun 2023, Last Modified: 19 Jun 2023IJCAI 2023 Workshop KBCG OralReaders: Everyone
Keywords: inductive reasoning, inductive logic programming, in-context-learning
TL;DR: We use Inductive In-Context Learning and a novel evaluation setting to probe the reasoning ability of LMs and find that they learn more than statistics from the data.
Abstract: Despite their recent success, Language Models (LMs) have brought to question whether they statistically repeat data (‘stochastic parrots’)[Bender et al., 2021] or can learn the underlying generative process of the data. Current benchmark used to probe the reasoning ability of LMs can be in-applicable in this setting as they are curated by human annotators and the generative process can be ambiguous. Additionally, current work to probe reasoning ability of LMs introduce the same bias as the question they attempt to answer; that is whether the model has learn the statistics of the benchmark data. In this work we introduce a novel evaluation setting that we use with Inductive In-Context Learning (IIL) and a dataset, ReAnalogy, to probe the reasoning ability of LMs. ReAnalogy consists of sequences with positive examples, generated from regular expressions (regex) and contain quasi-natural language. We use regex to evaluate implicitly whether a LM can infer ‘Rules’ (regex) given limited set of examples (‘Facts’). We use the LM to generate additional Facts to evaluate whether the generated Facts abide by the Rules. We evaluate a GPT model in our setting and compare with the same model where a Rule is injected during training to replace a Fact. We use IIL during evaluation to probe the model to infer the Rule given Facts. We then use the inferred Rule to synthesize an additional Fact. IIL improve ‘reasoning’ performance by as much as 33%. Our results suggest that LMs can learn more than statistical patterns in the data and we support our findings with ablation studies. We evaluate our dataset with existing benchmark and baselines in inductive programming and find that current state-of-the-art symbolic or neuro-symbolic approaches fail to the complexity of our dataset; while the existing dataset and benchmark in the do main are inapplicable for LMs. Our probing method and dataset are complex enough for LMs and applicable for evaluating the inductive reasoning abilities of LMs, while IIL can improve ‘reasoning’ of LMs. We open-source our dataset and the code used for our experiments.
0 Replies

Loading