Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems

Stephen Miner; Yoshiki Takashima; Simeng Han; Sam Kouteili; Ferhat Erata; Ruzica Piskac; Scott J Shapiro

Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems

Stephen Miner, Yoshiki Takashima, Simeng Han, Sam Kouteili, Ferhat Erata, Ruzica Piskac, Scott J Shapiro

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Synthetic math reasoning data construction method.

TL;DR: We present Scheherazade: an automated approach to produce challenging mathematical reasoning benchmarks using two novel techniques for chaining problems together using logical implication.

Abstract: Benchmarks are critical for measuring Large Language Model (LLM) reasoning capabilities. Some benchmarks have even become the de facto indicator of such capabilities. However, as LLM reasoning capabilities improve existing widely-used benchmarks such as \GSMek marginally encapsulate model reasoning differentials - most state-of-the-art models achieve over 94% accuracy on the \GSMek dataset \citep{gsm8k_res}. While constructing harder benchmarks is possible, their creation is often manual, expensive, and unscalable. As such, we present {\Scheherazade}, an automated scalable approach to produce challenging mathematical reasoning benchmarks by \emph{logically chaining} a small starting set of problems. We propose two different chaining methods, forward chaining and backward chaining, which include randomized branching techniques to generate complex reasoning problems. We apply Scheherazade on \GSMek to create \textit{\GSMek-Scheherazade} and evaluate 3 frontier LLMs and OpenAI's \OpenAIOOne on it. We show that while other frontier models' performance declines precipitously at only a few questions chained, our evaluation suggests \OpenAIOOne's performance persists, with the flagship OpenAI model the only one to perform better at backward reasoning. We further developed an error taxonomy for \textit{\GSMek-Scheherazade}, revealing error categories distinct from those observed in the original \GSMek. Our technique is scalable and allows for the generation of a large number of problems given a small starting set, providing a potential training resource of reasoning tasks with diverse complexities for enhancing large language model reasoning. Our data and code will be publicly available at https://github.com/YoshikiTakashima/scheherazade-code-data.

Submission Number: 269

Loading