Copilot Evaluation Harness: Building User Trust in LLMs and LM Agents for IDE Environments

Copilot Evaluation Harness: Building User Trust in LLMs and LM Agents for IDE Environments

ICLR 2025 Workshop BuildingTrust Submission108 Authors

11 Feb 2025 (modified: 06 Mar 2025)Submitted to BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0

Track: Long Paper Track (up to 9 pages)

Keywords: Machine Learning, LLMs for Code Generation, Reliability, Execution-Based Benchmark

TL;DR: Benchmark for execution-based code generation tasks and model findings based on benchmark results.

Abstract: The addition of Large Language Models (LLMs) into Integrated Code Development Environments (IDEs) has become a focal point in modern software development. LLMs offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants, especially with the increase in LLM-driven coding agents. With these tools comes the need for safeguards and metrics for quality assurance for consumers. In this paper, we introduce the Copilot Evaluation Harness: a set of data and tools for evaluating LLM-guided coding, covering various programming scenarios and languages. We propose a more robust system for measuring and understanding model behavior when leveraged as chat coding assistants or coding agents than previous state of the art evaluation metrics. We design and compute both static and execution-based success metrics on a wide range of developer tasks, including documentation generation from code (doc), test case generation (test), and bug-fixing (fix). In the chat scenario, we see that GPT4o has much lower prompt sensitivity than the other models. In the agentic scenario, we find that reasoning models are more inclined to generate one-shot solutions, even when given multiple turns and access to tool calling. We show how results from our metrics can be used to increase the interpretability and explainability of LLMs in the real-world IDE-chat scenario.

Submission Number: 108

Loading