Performance analysis of AI-generated code: A case study of Copilot, Copilot Chat, CodeLlaMa, and DeepSeek-Coder models
Abstract: The integration of Large Language Models (LLMs) into software development tools like GitHub Copilot and Copilot Chat, along with the advancement of code generation models like DeepSeek-Coder and CodeLlama, hold the promise of transforming code generation processes. While AI-driven code generation presents numerous advantages for software development, code generated by LLMs may introduce challenges related to security, privacy, and copyright issues. However, the performance implications of AI-generated code remain insufficiently explored. This study conducts an empirical analysis focusing on the performance regressions of code generated by GitHub Copilot, Copilot Chat, CodeLlama, and Deepseek-Coder across four distinct datasets: HumanEval, AixBench, MBPP, and the performance-oriented benchmark EvalPerf. We adopt a comprehensive methodology encompassing static and dynamic performance analyses to assess the effectiveness of the generated code. Our findings reveal that although the generated code is functionally correct, it frequently exhibits performance regressions compared to code solutions crafted by humans. We further investigate the code-level root causes responsible for these performance regressions. We identify four major root causes, i.e., inefficient function calls, inefficient looping, inefficient algorithms, and inefficient use of language features. We further identify a total of eleven sub-categories of root causes attributed to the performance regressions of generated code. Additionally, we explore prompt engineering including few-shot and Chain-of-Thought (CoT) prompting as a potential strategy for optimizing performance. The outcomes demonstrate that few-shot prompting, grounded in identified root causes of code performance regressions, can improve the performance of generated code by guiding models toward performance-oriented generation. In contrast, CoT prompting proves less effective, and in some cases detrimental, suggesting that reasoning-oriented strategies do not necessarily enhance performance. Across both general-purpose and efficiency-oriented benchmarks, our analysis reveals that performance regressions persist regardless of dataset scope, underscoring the necessity of treating performance as a first-class dimension of code quality. This research provides valuable insights that contribute to a more comprehensive understanding of AI-assisted code generation.
External IDs:dblp:journals/ese/LiCCXHS26
Loading