Keywords: Large Language Models, Code Reasoning, Program Analysis, Code Understanding, Static Analysis, Benchmarking
TL;DR: CoRe: a high-quality, multi-lingual benchmark for evaluating LLMs’ Code Reasoning capabilities with fundamental static analysis tasks.
Abstract: Large language models (LLMs) have been widely adopted across diverse domains of software engineering, such as code generation, program repair, and vulnerability detection. These applications require understanding beyond surface-level code patterns: value propagation, control flow, and interdependence between program elements. However, existing benchmarks primarily evaluate end-to-end outcomes, such as whether code is correctly repaired or generated,
leaving the models' ability of program semantic reasoning underexplored.
This work presents CoRe, a high-quality, human-verified benchmark designed to evaluate LLMs on fundamental static analysis tasks. CoRe includes 12,553 task instances spanning data dependency, control dependency, and information flow across programs written in C/C++, Java, and Python.
To ensure semantic diversity and reasoning complexity, we propose a semantics-aware diverse sampling strategy that selects targets and task instances based on structural coverage and dependency depth.
We evaluate 10 state-of-the-art LLMs and show that, while they perform well at identifying dependencies, models still struggle with tasks that require deeper semantic understanding and multi-step reasoning.
We further conduct qualitative analyses to uncover key challenges, such as complex control structures and backward dependency patterns, offering insights into improving LLMs’ code reasoning capabilities.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/lt-asset/CoRe
Code URL: https://github.com/CoReBench/CoRe
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Submission Number: 2192
Loading