CoCoNUT: Structural Code Understanding does not fall out of a tree

Published: 2025, Last Modified: 25 Jan 2026LLM4Code@ICSE 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Large Language Models (LLMs) have shown impressive performance across a wide array of tasks involving both structured and unstructured textual data. More recently, LLMs have demonstrated remarkable abilities to generate code across different programming languages. Recent results on various benchmarks for code generation, repair, or completion suggest that certain models have programming abilities comparable to or even surpass humans. In this work, we demonstrate that the high performance on such benchmarks does not correlate to humans’ innate ability to understand the structural control flow of code.For this purpose, we extract code solutions from the HumanEval benchmark, which the relevant models perform very strongly on, and trace their execution path using function calls sampled from the respective test set. Using this dataset, we investigate the ability of 7 state-of-the-art LLMs to match the execution trace and find that, despite the model’s abilities to generate semantically identical code, they possess only limited ability to trace the execution path, especially for longer traces and specific control structures. We find that even the top-performing model, Gemini 1.5 Pro can only fully correctly generate the trace of 47% of HumanEval tasks.In addition, we introduce a subset for three key structures not, or only contained to a limited extent in HumanEval: Recursion, Parallel Processing, and Object Oriented Programming principles, including concepts like Inheritance and Polymorphism. Besides OOP, we show that none of the investigated models achieve an average accuracy of over 5% on the relevant traces. Aggregating these specialized parts with the ubiquitous HumanEval tasks, we present the Benchmark CoCoNUT: Code Control Flow for Navigation Understanding and Testing, which measures a models ability to trace the execution of code upon relevant calls, including advanced structural components. We conclude that the current generation LLMs still need to significantly improve to enhance their code reasoning abilities. We hope our dataset can help researchers bridge this gap in the near future.
Loading