CoRe: Benchmarking LLMs’ Code Reasoning Capabilities through Static Analysis Tasks

Danning Xie; Mingwei Zheng; Xuwei Liu; Jiannan Wang; Chengpeng Wang; Lin Tan; Xiangyu Zhang

CoRe: Benchmarking LLMs’ Code Reasoning Capabilities through Static Analysis Tasks

Danning Xie, Mingwei Zheng, Xuwei Liu, Jiannan Wang, Chengpeng Wang, Lin Tan, Xiangyu Zhang

Published: 18 Sept 2025, Last Modified: 17 Jan 2026NeurIPS 2025 Datasets and Benchmarks Track spotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Code Reasoning, Program Analysis, Code Understanding, Static Analysis, Benchmarking

TL;DR: CoRe: a high-quality, multi-lingual benchmark for evaluating LLMs’ Code Reasoning capabilities with fundamental static analysis tasks.

Abstract: Large language models (LLMs) have been widely adopted across diverse domains of software engineering, such as code generation, program repair, and vulnerability detection. These applications require understanding beyond surface-level code patterns: value propagation, control flow, and interdependence between program elements. However, existing benchmarks primarily evaluate end-to-end outcomes, such as whether code is correctly repaired or generated, leaving the models' ability of program semantic reasoning underexplored. This work presents CoRe, a high-quality, human-verified benchmark designed to evaluate LLMs on fundamental static analysis tasks. CoRe includes 12,553 task instances spanning data dependency, control dependency, and information flow across programs written in C/C++, Java, and Python. To ensure semantic diversity and reasoning complexity, we propose a semantics-aware diverse sampling strategy that selects targets and task instances based on structural coverage and dependency depth. We evaluate 10 state-of-the-art LLMs and show that, while they perform well at identifying dependencies, models still struggle with tasks that require deeper semantic understanding and multi-step reasoning. We further conduct qualitative analyses to uncover key challenges, such as complex control structures and backward dependency patterns, offering insights into improving LLMs’ code reasoning capabilities.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/lt-asset/CoRe

Code URL: https://github.com/CoReBench/CoRe

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Submission Number: 2192

Loading