Keywords: Code Generation, Benchmarking, Large Language Models, Software Engineering, Test Generation, Code Review
TL;DR: A holistic benchmark for lllm-based software engineering agents consisting of bug-fixing, test-generation, style-fixing and addressing code reviews.
Abstract: LLM-powered coding agents are redefining how real-world software is developed.
To drive the research towards better coding agents, we require challenging bench-
marks that can rigorously evaluate the ability of such agents to perform various
software engineering tasks. However, popular coding benchmarks such as Hu-
manEval and SWE-Bench focus on narrowly scoped tasks such as competition
programming and patch generation. In reality, software engineers have to handle
a broader set of tasks for real-world software development. To address this gap,
we propose OmniCode, a novel software engineering benchmark that contains a
diverse set of task categories, including responding to code reviews, test generation,
fixing style violations, and program repair. Overall, OmniCode contains 2912 tasks
in Python, 728 tasks per category.
In contrast to prior software engineering benchmarks, the tasks in OmniCode are
(1) manually validated to eliminate ill-defined problems, and (2) synthetically
crafted or recently curated to avoid data leakage issues, presenting a new frame-
work for synthetically generating diverse software tasks from limited real world
data. We evaluate OmniCode with popular agent frameworks such as SWE-Agent,
demonstrating shortcomings on tasks that differ from bug-fixing. For instance,
SWE-Agent with Gemini 2.5 Flash obtains 14.0 % on test generation and 8.1 %
on fixing style issues. With OmniCode, our aim is to spur the development of
agents that can perform well across a broader spectrum of software development
processes.
Primary Area: datasets and benchmarks
Submission Number: 21476
Loading