Agentified Benchmarking for Logical Reasoning Agents

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0
Keywords: benchmarking, agentified assessment, logical reasoning
Abstract: We present a framework for evaluating and benchmarking logical reasoning agents when assessment itself must be reproducible, auditable, and robust to execution failures. Building on agentified assessment, we use an assessor agent to issue tasks, enforce execution budgets, parse outputs, and record structured failure types, while the agent under test only needs to expose a standardized agent-to-agent interface. As a case study, we benchmark an auto-formalization agent for first-order logic (FOL) reasoning on a solver-verified, repaired, and fully reviewed refined split of FOLIO. The agent translates natural language premises and conclusions into executable Z3Py programs and employs satisfiability modulo theories (SMT) solving to determine logical entailment. On the refined FOLIO validation set, the auto-formalization agent achieves 86.21\% accuracy under the assessor protocol, outperforming a chain-of-thought baseline (71.92\%). Recomputing against the original FOLIO validation labels yields 85.71\% and 71.43\%, respectively, because the original and refined validation splits contain identical premise-conclusion texts and differ only on three labels. Code is available at \url{https://github.com/zyni2001/agentified-logical-reasoning}.
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 153
Loading