Abstract: Formal logic enables computers to reason in natural language by representing sentences in symbolic forms and applying rules to derive conclusions. However, in what our study characterizes as "rulebreaker" scenarios, this method can lead to conclusions that are typically not inferred or accepted by humans given their common sense and factual knowledge. Inspired by works in cognitive science, we create RULEBREAKERS, the first dataset for rigorously evaluating the ability of large language models (LLMs) to recognize and respond to rulebreakers (versus non-rulebreakers) in a knowledge-informed and human-like manner. Evaluating seven LLMs, we find that most models achieve mediocre accuracy on RULEBREAKERS and exhibit some tendency to over-rigidly apply logical rules, unlike what is expected from typical human reasoners. Further analysis suggests that this apparent failure is potentially associated with the models' poor utilization of their world knowledge and their attention distribution patterns. Whilst revealing a limitation of current LLMs, our study also provides a timely counterbalance to a growing body of recent works that propose methods relying on formal logic to improve LLMs' general reasoning capabilities, highlighting their risk of further increasing divergence between LLMs and human-like reasoning.
Lay Summary: Humans use common sense and real-world knowledge in everyday reasoning. This flexibility sets us apart from systems that rely strictly on formal logic. For example, suppose we are told that "Anne is either in Stockholm or somewhere in Sweden" but we find out in fact that "Anne is not in Sweden”. Common sense would tell us that she is not in Stockholm either, because Stockholm is in Sweden. By contrast, simply applying a logical rule, “A or B; not B; therefore, A”, would lead to a counterintuitive conclusion that “Anne is in Stockholm”. We call these situations “rulebreakers”, where rigidly applying logic leads to conclusions that contradict common sense and real-world knowledge.
Do large language models (LLMs) reason flexibly like humans in these situations? To find out, we create a large dataset of rulebreakers to test these models. We find that most LLMs struggle to do so: they tend to over-rigidly apply logical rules and accept conclusions that contradict common sense and real-world knowledge. This highlights not only a shortcoming of current LLMs, but also an important pitfall in guiding them to reason more “logically”, as it could further undermine their ability to reason in a flexible and human-like manner.
Primary Area: Deep Learning->Large Language Models
Keywords: reasoning, logic, human-like reasoning, cognitive science, large language models
Submission Number: 13113
Loading