CodeStructEval: A Holistic Evaluation Framework of Code Structure Generation and Comprehension

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, evaluation, code structure comprehension, abstract syntax tree
Abstract: As Large Language Models (LLMs) rapidly evolve and demonstrate strong performances in software engineering tasks, a growing number of researchers are focusing on evaluating LLMs' code generation capabilities. Different from previous benchmarks that primarily focus on evaluating LLMs' ability to generate sequential code from natural language requirements, we propose to assess their capabilities in generating and comprehending code structures. The two aspects represent a deeper, more fundamental understanding of program logic that better reflects the model's capacity for logical reasoning and structural awareness. Specifically, in the paper, we formally propose two tasks: CSG ($\textbf{C}$ode $\textbf{S}$tructure $\textbf{G}$eneration) and CSC ($\textbf{C}$ode $\textbf{S}$tructure $\textbf{C}$omprehension). The former requires LLMs to generate code structural information from given code, while the latter requires it to generate code from given code structural information. Then, we design a holistic evaluation framework called CodeStructEval to assess LLMs' code structure generation and comprehension capabilities. This programming language agnostic evaluation framework has three main parts: 1) data preprocessing, 2) model inference, and 3) automated evaluation. For evaluation metrics, we introduce SAR ($\textbf{S}$emantic $\textbf{A}$ccuracy $\textbf{R}$ate) and StAR ($\textbf{St}$ructure $\textbf{A}$ccuracy $\textbf{R}$ate) to assess LLM' output quality semantically and structurally, respectively. Then, using the CodeStructEval framework and the HumanEval seed dataset, we built a benchmark with 157 samples across three difficulty levels (Easy, Medium, Hard). At last, we use this benchmark to thoroughly evaluate the code structure generation and comprehension abilities of 18 mainstream LLMs. Our experimental results show that closed-source commercial LLMs demonstrate strong code structure generation and comprehension capabilities, while smaller open-source LLMs still have room for improvement.
Primary Area: datasets and benchmarks
Submission Number: 23435
Loading