DoReMi - Difficulty-Oriented Reasoning Effort Modeling of Science Problems for Language Models

20 Sept 2025 (modified: 07 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Evaluation, Difficulty analysis, Reasoning models, Reasoning effort, Scientific benchmarks, LLMs
TL;DR: Analysing how Bloom’s taxonomy-based difficulty metrics help measure the reasoning effort required by LLMs on science benchmarks, revealing key differences across model generations and offering guidelines for evaluation and improvement.
Abstract: We introduce DoReMi (Difficulty-Oriented Reasoning Effort Modeling), a structured framework leveraging an extended Bloom's taxonomy to comprehensively characterize intrinsic problem difficulty for large language models on scientific reasoning tasks. DoReMi systematically annotates problems along seven cognitive and methodological axes using judge LLMs distinct from those being evaluated, with human annotations confirming the validity of these assessments. We empirically quantify LLM reasoning effort through metrics including minimum reasoning tokens required for solution, expected trials to first success. Our validation demonstrates strong agreement across diverse judge LLMs spanning both open-source and proprietary LLMs. Evaluations on GPQA, ARC, and SuperGPQA reveal that our multidimensional difficulty fingerprints correlate strongly with and enable accurate predictive modeling of LLM reasoning effort. DoReMi enables principled difficulty-aware subset selection that substantially outperforms static-difficulty baselines while providing interpretable diagnostics that uncover emergent reasoning capabilities across successive model generations. This framework offers actionable insights for benchmark design and targeted post-training improvements toward higher-order reasoning skills.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24711
Loading