Keywords: reasoning capability, llm reasoning, large language models, token bias, hypothesis testing, logical fallacy
TL;DR: This study proposes a hypothesis-testing framework to determine whether large language models possess genuine reasoning abilities or rely on token bias.
Abstract: This study proposes a hypothesis-testing framework to determine whether large language models (LLMs) possess genuine reasoning abilities or rely on token bias. Carefully-controlled synthetic datasets are generated, and null hypotheses assuming LLMs' reasoning capabilities are tested with statistical guarantees. Inconsistent behavior during experiments leads to the rejection of null hypotheses. Our findings, using the conjunction fallacy as a quintessential example, suggest that current LLMs still struggle with probabilistic reasoning, with apparent performance improvements largely attributable to token bias.
Submission Number: 10
Loading