Keywords: Inductive Logic, Mathematical Reasoning, Overfitting Robustness, Benchmark
TL;DR: A new multi-turn inductive logic benchmark where each test case starts from the same prompt; reveals sensible ordering of reasoning capability.
Abstract: While large language models (LLMs) have shown impressive capabilities across a wide range of domains, they still encounter significant challenges in reasoning tasks that require gathering evidence over multiple turns and drawing logical conclusions from this evidence. Despite the multi-turn nature of many real-world LLM use cases, most existing benchmarks rely on carefully curated single-turn tests, which often blur the line between memorization and genuine reasoning. To address this, we introduce the $\textbf{Wason Inductive Logic Test (WILT)}$, a simple yet challenging multi-turn reasoning benchmark designed to resist memorization. WILT is inspired by the Wason 2-4-6 task, where participants must infer a basic boolean function involving three variables (e.g., $x < y < z$) by proposing test cases (such as $(2, 4, 6)$). In WILT, each test starts from a clean slate, with only the initial instructions provided, preventing models from relying on pre-learned responses. Our findings reveal that LLMs struggle with this task, with the best-performing model achieving only 28% accuracy, highlighting a significant gap in LLM performance on complex multi-turn reasoning tasks.
Concurrent Submissions: Also submitted concurrently to ICLR 2025.
Submission Number: 9
Loading