OmniCode: A Benchmark for Evaluating Software Development Agents

OmniCode: A Benchmark for Evaluating Software Development Agents

ICLR 2026 Conference Submission21476 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code Generation, Benchmarking, Large Language Models, Software Engineering, Test Generation, Code Review

TL;DR: A holistic benchmark for lllm-based software engineering agents consisting of bug-fixing, test-generation, style-fixing and addressing code reviews.

Abstract: LLM-powered coding agents are redefining how real-world software is developed. To drive the research towards better coding agents, we require challenging bench- marks that can rigorously evaluate the ability of such agents to perform various software engineering tasks. However, popular coding benchmarks such as Hu- manEval and SWE-Bench focus on narrowly scoped tasks such as competition programming and patch generation. In reality, software engineers have to handle a broader set of tasks for real-world software development. To address this gap, we propose OmniCode, a novel software engineering benchmark that contains a diverse set of task categories, including responding to code reviews, test generation, fixing style violations, and program repair. Overall, OmniCode contains 2912 tasks in Python, 728 tasks per category. In contrast to prior software engineering benchmarks, the tasks in OmniCode are (1) manually validated to eliminate ill-defined problems, and (2) synthetically crafted or recently curated to avoid data leakage issues, presenting a new frame- work for synthetically generating diverse software tasks from limited real world data. We evaluate OmniCode with popular agent frameworks such as SWE-Agent, demonstrating shortcomings on tasks that differ from bug-fixing. For instance, SWE-Agent with Gemini 2.5 Flash obtains 14.0 % on test generation and 8.1 % on fixing style issues. With OmniCode, our aim is to spur the development of agents that can perform well across a broader spectrum of software development processes.

Primary Area: datasets and benchmarks

Submission Number: 21476

Loading