Keywords: Agent, Benchmark, Compilation, LLM
Abstract: Automatically compiling open-source software (OSS) projects is a vital, labor-intensive, and complex task, which makes it a good challenge for LLM Agents.
Existing methods rely on manually curated rules and workflows, which cannot
adapt to OSS that requires customized configuration or environment setup. Recent
attempts using Large Language Models (LLMs) used selective evaluation on a
subset of highly rated OSS, a practice that underestimates the realistic challenges
of OSS compilation. In practice, compilation instructions are often absent, de-
pendencies are undocumented, and successful builds may even require patching
source files or modifying build scripts. We propose a more challenging and realistic
benchmark, BUILD-BENCH, comprising OSS that are more diverse in quality,
scale, and characteristics. Furthermore, we propose a strong baseline LLM-based
agent, OSS-BUILD-AGENT, an effective system with enhanced build instruction
retrieval module that achieves state-of-the-art performance on BUILD-BENCH and
is adaptable to heterogeneous OSS characteristics. We also provide detailed analysis regarding different compilation method design choices and their influence to
the whole task, offering insights to guide future advances. We believe performance
on BUILD-BENCH can faithfully reflect an agent’s ability to tackle compilation
as a complex software engineering tasks, and, as such, our benchmark will spur
innovation with a significant impact on downstream applications in the fields of
software development and software security.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 21758
Loading