Track: Long Paper Track (up to 9 pages)
Keywords: Machine Learning, LLMs for Code Generation, Reliability, Execution-Based Benchmark
TL;DR: Benchmark for execution-based code generation tasks and model findings based on benchmark results.
Abstract: The addition of Large Language Models (LLMs) into Integrated Code Development Environments (IDEs) has become a focal point in modern software development. LLMs offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants, especially with the increase in LLM-driven coding agents. With these tools comes the need for safeguards and metrics for quality assurance for consumers. In this paper, we introduce the Copilot Evaluation Harness: a set of data and tools for evaluating LLM-guided coding, covering various programming scenarios and languages. We propose a more robust system for measuring and understanding model behavior when leveraged as chat coding assistants or coding agents than previous state of the art evaluation metrics.
We design and compute both static and execution-based success metrics on a wide range of developer tasks, including documentation generation from code (doc), test case generation (test), and bug-fixing (fix). In the chat scenario, we see that GPT4o has much lower prompt sensitivity than the other models. In the agentic scenario, we find that reasoning models are more inclined to generate one-shot solutions, even when given multiple turns and access to tool calling. We show how results from our metrics can be used to increase the interpretability and explainability of LLMs in the real-world IDE-chat scenario.
Submission Number: 108
Loading