Keywords: coding agents, software engineering benchmark, multi-lingual
TL;DR: A multi-lingual repository level benchmark for evaluating software engineering agents
Abstract: Coding agents powered by large language models have shown impressive capabilities in software engineering tasks, but evaluating their performance across diverse programming languages and real-world scenarios remains challenging. We introduce SWE-PolyBench, a new multi-language benchmark for repository-level, execution-based evaluation of coding agents. SWE-PolyBench contains 2110 instances from 21 repositories and includes tasks in Java, JavaScript, TypeScript and Python, covering bug fixes, feature additions, and code refactoring. We provide a verified subset of 384 instances(SWE-PolyBench_Verified), a task and repository-stratified subsample of 500 instances (SWE-PolyBench500) and release an evaluation harness allowing for fully automated evaluation. We further introduce novel instance stratifications and retrieval metrics rooted in syntax tree analysis to deepen the understanding of coding agent performances. Our experiments with leading open-source coding agents on SWE-PolyBench show that current agents exhibit uneven performances across languages and struggle with complex problems, while showing higher performance on simpler tasks. SWE-PolyBench aims to drive progress in developing more versatile and robust AI coding assistants for real-world software engineering
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 13410
Loading