CryptoX : Compositional Reasoning Evaluation of Large Language Models

ICLR 2026 Conference Submission12363 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, benchmark, compositional reasoning, crypto
TL;DR: CryptoX is a plug-in evaluation framework (with CryptoBench) that quantifies LLMs’ compositional reasoning, reveals large gaps across 40+ models, and pinpoints stage-wise skills that drive CR performance.
Abstract: The compositional reasoning ability has long been regarded as critical to the generalization and intelligence emergence of large language models (LLMs). However, despite numerous reasoning-related benchmarks, the compositional reasoning capacity of LLMs is rarely studied or quantified in the existing benchmarks. In this paper, we introduce **CryptoX**, a plug-in evaluation framework that to quantify the compositional reasoning capacity of LLMs. Building upon CryptoX, we construct **CryptoBench**, which integrates atomic transformation rules from CryptoX into a set of relatively simple benchmarks, which serve as proxies to reflect models’ CR ability in tackling more complex real-world problems. We conduct comprehensive experiments on 40+ widely used LLMs using CryptoBench with the well-designed metric, which clearly show disparities in their CR abilities. Through further analytical experiments, we demonstrate that CryptoX can indeed evaluate the true CR ability of models. Moreover, by analyzing open-source models with mechanistic interpretability methods, we find that the CR process exhibits a clear stage-wise structure—Subtask Decomposition, Subtask Solving, and Integration. Finally, through both formal analysis and experiments, we show that two of these stages, corresponding to Reasoning Path Planning Ability and Subtask Decomposition Ability, play a pivotal role in determining the effectiveness of the CR process.
Primary Area: datasets and benchmarks
Submission Number: 12363
Loading