TL;DR: Copilot Arena is a platform for conducting realistic evaluations of code LLMs, collecting human preferences of coding models with real users, real tasks, and in realistic environments.
Abstract: Evaluating in-the-wild coding capabilities of large language models (LLMs) is a challenging endeavor with no existing solution. We introduce Copilot Arena, a platform to collect user preferences through native integration into a developer's working environment. Copilot Arena comprises a novel interface for comparing pairs of model outputs, a sampling strategy to reduce experienced latency, and a prompting scheme to enable code completion functionality. Copilot Arena has served over 4.5 million suggestions from 10 models and collected over 11k pairwise judgements. Our results highlight the importance of model evaluations in integrated settings. We find that model rankings from Copilot Arena differ from those of existing evaluations, which we attribute to the unique distribution of data and tasks contained in Copilot Arena. We also identify novel insights into human preferences on code such as an observed consistency in user preference across programming languages yet significant variation in preference due to task category. We open-source Copilot Arena and release data to enable human-centric evaluations and improve understanding of coding assistants.
Lay Summary: We developed a tool called Copilot Arena to better understand how people use code suggestions from large language models. Traditional ways of testing these models do not fully reflect how developers work in real environments, so we created a system that integrates directly into coding tools, making it easier to collect real user feedback.
Copilot Arena shows users two different code suggestions and asks which one they prefer. It also uses smart techniques to show code suggestions in a timely manner with proper formatting. So far, it has delivered over 4.5 million suggestions from 10 different AI models and gathered more than 11,000 user votes.
Our findings show that user preferences for model suggestions can vary depending on the task but are often consistent across programming languages. Interestingly, some top-performing models in prior tests do not always perform best in real-world coding. We’ve made Copilot Arena and its data public to help others build better AI coding tools.
Link To Code: https://github.com/lmarena/copilot-arena
Primary Area: General Machine Learning->Evaluation
Keywords: evaluation, code, llm
Submission Number: 5145
Loading