Large Language Models Are Not Strong Abstract Reasoners

Gael Gendron; Qiming Bao; Michael Witbrock; Gillian Dobbie

Large Language Models Are Not Strong Abstract Reasoners

Gael Gendron, Qiming Bao, Michael Witbrock, Gillian Dobbie

20 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Supplementary Material: zip

Primary Area: causal reasoning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Abstract Reasoning, Large Language Models, Natural Language Processing

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We extensively evaluate Large Language Models on Abstract Reasoning tasks and show that they achieve poor performance, by contrast with other natural language tasks.

Abstract: Large Language Models have shown tremendous performance on a large variety of natural language processing tasks, ranging from text comprehension to common sense reasoning. However, the mechanisms responsible for this success remain opaque, and it is unclear whether LLMs can achieve human-like cognitive capabilities or whether these models are still fundamentally circumscribed. Abstract reasoning is a fundamental task for cognition, consisting of finding and applying a general pattern from few data. Evaluating deep neural architectures on this task could give insight into their potential limitations regarding reasoning and their broad generalisation abilities, yet this is currently an under-explored area. In this paper, we introduce a new benchmark for evaluating language models beyond memorization on abstract reasoning tasks. We perform extensive evaluations of state-of-the-art LLMs, showing that they achieve very limited performance in contrast with other natural language tasks, and we examine the reasons for this difference. We apply techniques that have been shown to improve performance on other NLP tasks and show that their impact for abstract reasoning is limited.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2575

Loading