Do LLMs Really Understand Code? A Semantic Benchmark with Automated Question Generation and Evaluation
Keywords: Code Semantics Evaluation, Large Language Models (LLMs), Benchmark Design, Automated Question Generation
TL;DR: We introduce a novel benchmark for assessing how well large language models understand the semantics of C programs directly, with six tasks each framed as yes/no questions automatically generated from source files.
Abstract: Large language models (LLMs) have demonstrated impressive results in code generation tasks, yet it is unclear to what extent they genuinely understand code semantics and whether this affects their ability to write high-quality code. To address this question, we introduce \textbf{SemBench}, a novel benchmark consisting of \textbf{1,000} diverse C programs sourced from the CodeParrot GitHub-code dataset, with \textbf{15,404} semantic questions spanning six fundamental properties: function reachability, loop reachability, data dependency, liveness of variables, dominator sets, and dead code. These six types of concepts are taught in undergraduate-level programming language classes and can be computed precisely and efficiently by deterministic algorithms. In contrast to existing benchmarks (e.g. HumanEval, MBPP, CodeXGLUE, SWE-bench) that emphasize code generation or functional correctness, our benchmark focuses on semantic understanding with deterministic answers. We evaluate \textbf{14} popular LLMs across \textbf{7} families—including GPT-4o Mini, GPT-3.5 Turbo, DeepSeek-Coder, CodeLlama, Qwen, StarCoder, Mistral, and Phi. To our surprise, they have very high failure rates, ranging from \textbf{21.40\%} to \textbf{81.86\%}. Category analysis reveals a sharp split between “shallow” control-flow and “deep” data-flow reasoning and highlights performance divergence across task types, where different models excel on different categories. SemBench rankings demonstrate high correlation with HumanEval and MBPP, %(Spearman’s correlation \(\rho{=}\)0.61/0.72), which proves its potential to be a good indicator of whether an LLM can produce high-quality code. In fact, further study shows that the LLMs under evaluation have difficulty even understanding their \textit{own coding output}. For example, DeepSeek-Coder-V2-Lite-Instruct fails to identify variable liveness correctly 58.23\% time. Overall, our experiments provide deeper insights into semantic understanding, reveal the substantial gap between semantics and code completion in modern LLMs, and open new opportunities for further improvements of coding LLMs.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Submission Number: 21976
Loading