SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents

Muhammad Shihab Rashid; Christian Bock; Yuan Zhuang; Alexander Buchholz; Timothy B Esler; Simon Valentin; Luca Franceschi; Martin Wistuba; Prabhu Teja S; Woojung Kim; Anoop Deoras; Giovanni Zappella; Laurent Callot

SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents

Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Timothy B Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja S, Woojung Kim, Anoop Deoras, Giovanni Zappella, Laurent Callot

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: coding agents, software engineering benchmark, multi-lingual

TL;DR: A multi-lingual repository level benchmark for evaluating software engineering agents

Abstract: Coding agents powered by large language models have shown impressive capabilities in software engineering tasks, but evaluating their performance across diverse programming languages and real-world scenarios remains challenging. We introduce SWE-PolyBench, a new multi-language benchmark for repository-level, execution-based evaluation of coding agents. SWE-PolyBench contains 2110 instances from 21 repositories and includes tasks in Java, JavaScript, TypeScript and Python, covering bug fixes, feature additions, and code refactoring. We provide a verified subset of 384 instances(SWE-PolyBench_Verified), a task and repository-stratified subsample of 500 instances (SWE-PolyBench500) and release an evaluation harness allowing for fully automated evaluation. We further introduce novel instance stratifications and retrieval metrics rooted in syntax tree analysis to deepen the understanding of coding agent performances. Our experiments with leading open-source coding agents on SWE-PolyBench show that current agents exhibit uneven performances across languages and struggle with complex problems, while showing higher performance on simpler tasks. SWE-PolyBench aims to drive progress in developing more versatile and robust AI coding assistants for real-world software engineering

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 13410

Loading