A Benchmark and Pair-Level 4PL-IRT Framework for Reliable Evaluation of LLM Reasoning

Leizhen Zhang; Sheng Chen

A Benchmark and Pair-Level 4PL-IRT Framework for Reliable Evaluation of LLM Reasoning

Leizhen Zhang, Sheng Chen

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, reasoning, evaluation, symbolic reasoning, benchmarks, item response theory

TL;DR: A novel IRT-based evaluation method to measure LLMs’ symbolic reasoning ability at pair and instance levels.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet reliably evaluating their reasoning ability, particularly in symbolic reasoning, remains an open challenge. In this work, we introduce a novel evaluation framework based on Item Response Theory (IRT), applied at both the pair level and instance level, and compare its effectiveness against traditional metrics such as Accuracy, F1, and MCC. Through extensive experiments across multiple LLMs, we show that while conventional metrics provide limited and sometimes misleading signals, IRT-based measures---especially under the 4PL model at the pair level---offer more stable and reliable insights into the reasoning competence of LLMs. Our study further presents a new benchmark suite for symbolic reasoning, along with a principled methodology for its generation and evaluation. This framework not only highlights the shortcomings of standard metrics, but also establishes IRT as a more trustworthy foundation for assessing the reasoning abilities of LLMs. We argue that such rigorous evaluation methods are essential for guiding the future development of LLMs toward robust reasoning performance.

Primary Area: datasets and benchmarks

Submission Number: 23368

Loading