On the Paradox of Generalizable Logical Reasoning in Large Language Models

Xiaojuan Tang; Zilong Zheng; Jiaqi Li; Fanxu Meng; Song-Chun Zhu; Yitao Liang; Muhan Zhang

On the Paradox of Generalizable Logical Reasoning in Large Language Models

Xiaojuan Tang, Zilong Zheng, Jiaqi Li, Fanxu Meng, Song-Chun Zhu, Yitao Liang, Muhan Zhang

20 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Large Language Models; Symbolic Reasoning

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: The emergent few-shot reasoning capabilities of Large Language Models (LLMs) have excited the natural language and machine learning community over recent years. Despite the numerous successful applications, it remains an open question whether LLMs have generalizable logical reasoning abilities. In this work, we expose a surprising failure of generalization in logical reasoning tasks (deduction, induction, and abduction)---when semantics are decoupled from the language reasoning process (\ie, replacing semantic words with pure symbols), LLMs tend to perform much worse. We hypothesize that the learned \textit{semantics} of language tokens do the most heavy lifting during the reasoning process but fail to imitate the basic formal reasoning abilities of humans. Furthermore, we also attempt to fine-tune Llama-2 on pure symbolic reasoning tasks to narrow the gap. However, the results indicate that FT-Llama2 can utilize similar template matching to respond to reasoning queries, but it falls short of generalizing to novel logic rules. These surprising observations question whether modern LLMs have mastered the inductive, deductive, and abductive reasoning abilities as in human intelligence, and motivate research on unveiling the magic existing within the black-box LLMs and evaluating and improving language models' reasoning abilities.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2413

Loading