Programming with Pixels: Towards Generalist Software Engineering Agents

Published: 06 Mar 2025, Last Modified: 06 Mar 2025DL4C @ ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 9 pages)
Keywords: computer-use, software-engineering, code-generation, multimodal
TL;DR: Programming with Pixels (PwP) is an IDE-based environment where AI agents use basic computer interactions (typing, clicking) rather than predefined and hand-engineered APIs, enabling the path toward truly general-purpose software engineering agents.
Abstract:

Recent advancements in software engineering (SWE) agents have largely followed a $\textit{tool-based paradigm}$, where agents interact with hand-engineered tool APIs to perform specific tasks. While effective for specialized tasks, these methods fundamentally lack generalization, as they require predefined tools for each task and do not scale across programming languages and domains. We introduce $\textbf{Programming with Pixels}$ (PwP), an agent environment that unifies software development tasks by enabling $\textit{computer-use agents}$—agents that operate directly within an IDE through visual perception, typing, and clicking, rather than relying on predefined tool APIs. To systematically evaluate these agents, we propose \bench{}, a benchmark that unifies existing SWE benchmarks spanning tasks across multiple programming languages, modalities, and domains under a task-agnostic state and action space. Our experiments demonstrate that general-purpose computer-use agents can approach or even surpass specialized tool-based agents on various SWE tasks without the need for hand-engineered tools. However, our analysis shows that current models suffer from limited visual grounding and fail to exploit many IDE tools that could simplify their tasks. When agents can directly access IDE tools, without visual interaction, they show significant performance improvements, highlighting the untapped potential of leveraging built-in IDE capabilities. Our results establish PwP as a scalable testbed for building and evaluating future computer-use SWE agents that interact directly with development environments.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Presenter: Pranjal Aggarwal
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Submission Number: 52
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview