STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent Framework

ACL ARR 2025 February Submission7370 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: High-quality math datasets are essential for advancing the reasoning capabilities of large language models (LLMs). However, current datasets face three major issues: (i) outdated and insufficient challenging content to match the rapid advancement of LLMs, (ii) an overemphasis on strict step-by-step derivations, neglecting human-like reasoning, and (iii) limited reliability from single-agent synthetic generation. To address these challenges, we introduce \textbf{STORM-BORN}, a dataset of challenging mathematical derivations derived from the latest and most influential academic papers. Unlike conventional numerical reasoning or formalized proof, STORM-BORN focuses on natural language mathematical derivations that include dense human-like approximations and heuristic cues. To ensure the reliability and quality of the dataset, we propose a novel human-in-the-loop, multi-agent data generation framework, integrating reasoning-dense filters, multi-agent collaboration, and human mathematicians' evaluations. We curates a set of 2,000 synthetic samples, from which 100 most challenging and high-quality problems are selected via human experts. Empirical evaluations reveal that state-of-the-art AI models, such as GPT-o1, solve fewer than 5\% of the STORM-BORN problems, underscoring the dataset’s inherent difficulty. As AI approaches mathematician-level reasoning, STORM-BORN offers a novel, challenging, and reliable resource to mimic human-like reasoning and serves as a high-difficulty evaluation benchmark.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: mathematical dataset, large lanuage models, natural lanuage generation
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English,Chinese
Submission Number: 7370
Loading