DSR-Bench: Evaluating the Structural Reasoning Abilities of LLMs via Data Structures

Yu He; Yingxi Li; Colin White; Ellen Vitercik

DSR-Bench: Evaluating the Structural Reasoning Abilities of LLMs via Data Structures

Yu He, Yingxi Li, Colin White, Ellen Vitercik

Published: 09 Jul 2025, Last Modified: 25 Jul 2025AI4Math@ICML25 PosterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: large language models, reasoning, math, benchmark, evaluation

TL;DR: We introduce a benchmark to evaluate LLMs’ structural reasoning ability: the capacity to construct, maintain, and reason about data structures ranging from simple to complex.

Abstract: Large language models (LLMs) are increasingly used in tasks involving complex mathematical and algorithmic reasoning. A core but often overlooked requirement across these tasks is the ability to perform structural reasoning---that is, to understand and reason about data relationships. For example, theorem proving requires maintaining a proof tree that organizes the hierarchical relationships among proof statements. However, existing benchmarks primarily focus on high-level, application-driven evaluations without isolating this fundamental capability. To address this gap, we introduce DSR-Bench, a novel benchmark evaluating LLMs' structural reasoning capabilities through data structures, which provide interpretable representations of data relationships. DSR-Bench includes 20 data structures, 35 operations, and 4,140 problem instances, organized hierarchically for fine-grained analysis of reasoning limitations. Our evaluation pipeline is fully automated and deterministic, eliminating subjective human or model-based judgments. We benchmark nine state-of-the-art LLMs, including some most advanced reasoning models. Our analysis shows that instruction-tuned models struggle with basic multi-attribute and multi-hop reasoning. Furthermore, while reasoning-oriented models perform better, they remain fragile on complex and hybrid structures, with the best model achieving an average score of only 47% on the challenge subset. Crucially, models often perform poorly on multi-dimensional data and natural language task descriptions, highlighting a critical gap for real-world deployment.

Submission Number: 103

Loading