Abstract: Code generation has gained increasing attention as a task to automate software development by transforming high-level descriptions into executable code. While large language models (LLMs) are effective in generating code, their performance heavily relies on the quality of input prompts. Current prompt engineering methods involve manual effort in designing prompts, which can be time-consuming and yield inconsistent results, potentially constraining the efficacy of LLMs in practical applications. This paper introduces Prochemy, a novel approach for automatically refining prompts iteratively to enhance code generation. Prochemy addresses the limitations of manual prompt engineering by automating the optimization process, ensuring prompt consistency during inference, and aligning with multi-agent systems. It iteratively refines prompts based on model performance, using an optimized final prompt to improve consistency and reliability across tasks. We evaluate Prochemy on both natural language-based code generation and code translation tasks using three series of LLMs. Results show that when combining Prochemy with existing approaches, it outperforms baseline prompting methods. It achieves improvements of 5.0% (GPT-3.5-Turbo) and 1.9% (GPT-4o) over zero-shot baselines on HumanEval. For the state-of-the-art LDB, Prochemy + LDB outperforms standalone methods by 1.2–1.8%. For code translation, Prochemy elevates GPT-4o’s performance on Java-to-Python (AVATAR) from 74.5 to 84.1 (+12.9%) and Python-to-Java from 66.8 to 78.2 (+17.1%). Furthermore, considering that the o1-mini model integrates prompt engineering techniques, Prochemy can continue to show good performance among it, further validating its effectiveness in code generation and translation tasks. Additionally, Prochemy is designed to be plug-and-play, optimizing prompts with minimal human intervention and seamlessly bridging the gap between simple prompts and complex frameworks.
External IDs:dblp:journals/tse/YeSWGLLL25
Loading