LogicEvolve: Advancing Logical Reasoning Toward Self-Evolution

17 Sept 2025 (modified: 26 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: logical reasoning, large language models, benchmark, automatic task generation, multi-agent framework, self-evolution
TL;DR: We present LogicEvolve, the first multi-agent framework for autonomously generating and evolving logical reasoning tasks, and CLUB, a unified benchmark for systematically evaluating models’ reasoning capabilities.
Abstract: The rapid progress of large language models (LLMs) highlights the urgent need for continuously evolving benchmarks that keep pace with advancing model capabilities. Yet existing benchmarks often rely on one-off curation or fixed scripts, lacking scalability and long-term adaptability. To this end, we present LogicEvolve, a highly automated multi-agent framework that enables dynamic control of deterministic symbolic tasks' structure, difficulty distribution, and scale with minimal human intervention. Building on LogicEvolve, we introduce CLUB (Complex Logical Unified Benchmark), spanning diverse task types—including string puzzles, grid reasoning, and card games—for systematic evaluation of logical reasoning. Experiments show that even state-of-the-art models such as Grok-4 and GPT-5 reach only ~55–56\% accuracy across multiple independent evaluations, far below desirable levels, with clear weaknesses in certain subcategories. These findings underscore logical reasoning as a persistent and unsolved core challenge for LLMs. All code, data, and an interactive evaluation platform will be publicly released after the review period, ensuring reproducibility and fostering further research.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9165
Loading