WILT: A Multi-turn, Memorization-Robust Inductive Logic Benchmark for LLMs

Eryk Banatt; Jonathan Cheng; Tiffany Hwu

WILT: A Multi-turn, Memorization-Robust Inductive Logic Benchmark for LLMs

Eryk Banatt, Jonathan Cheng, Tiffany Hwu

Published: 10 Oct 2024, Last Modified: 31 Oct 2024MATH-AI 24EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Inductive Logic, Mathematical Reasoning, Overfitting Robustness, Benchmark

TL;DR: A new multi-turn inductive logic benchmark where each test case starts from the same prompt; reveals sensible ordering of reasoning capability.

Abstract: While large language models (LLMs) have shown impressive capabilities across a wide range of domains, they still encounter significant challenges in reasoning tasks that require gathering evidence over multiple turns and drawing logical conclusions from this evidence. Despite the multi-turn nature of many real-world LLM use cases, most existing benchmarks rely on carefully curated single-turn tests, which often blur the line between memorization and genuine reasoning. To address this, we introduce the $\textbf{Wason Inductive Logic Test (WILT)}$, a simple yet challenging multi-turn reasoning benchmark designed to resist memorization. WILT is inspired by the Wason 2-4-6 task, where participants must infer a basic boolean function involving three variables (e.g., $x < y < z$) by proposing test cases (such as $(2, 4, 6)$). In WILT, each test starts from a clean slate, with only the initial instructions provided, preventing models from relying on pre-learned responses. Our findings reveal that LLMs struggle with this task, with the best-performing model achieving only 28% accuracy, highlighting a significant gap in LLM performance on complex multi-turn reasoning tasks.

Concurrent Submissions: Also submitted concurrently to ICLR 2025.

Submission Number: 9

Loading