PerfCodeGen: Improving Performance of LLM Generated Code with Execution Feedback

Yun Peng, Akhilesh Deepak Gotmare, Michael R. Lyu, Caiming Xiong, Silvio Savarese, Doyen Sahoo

Published: 2025, Last Modified: 08 Jan 2026Forge@ICSE 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Large Language Models (LLMs) are widely adopted for assisting in software development tasks, yet their performance evaluations have narrowly focused on the functional correctness of generated code. Human programmers, however, expect AI assistants to generate not only correct but also optimally efficient code. We propose PerfCodeGen, a training-free framework that enhances the performance of LLM-generated code by incorporating feedback based on runtime during test case execution into the self-refinement iterations. With PerfCodeGen, we achieve speedups for a significantly higher proportion of problems compared to using the base LLM with sophisticated prompting techniques. Applied to open-weight language models like Phi-3-mini, PerfCodeGen achieves code optimization rates comparable to naive prompting of powerful closed models like GPT-4. We achieve state-of-the-art code optimization on benchmarks such as HumanEval, MBPP, and APPS, frequently surpassing the ground truth reference solutions with PerfCodeGen using GPT-3.5 and GPT-4. Additionally, we demonstrate the effectiveness of our approach in enhancing code quality across a range of open-weight LLMs of varying sizes including Phi-3-mini (3.8B), Llama 3 8B, Mixtral 8x7B (13B active), Command R (35B), and Llama 3 70B. PerfCodeGen’s effectiveness at generating performant code underscores the importance of integrating execution feedback into the code generation process, highlighting a path forward for more robust and reliable AI-driven software development.

External IDs:dblp:conf/forge/PengGLXSS25