Automated Benchmark Generation for Repository-Level Code Input Synthesis via Coverage-Guided Fuzzing

Automated Benchmark Generation for Repository-Level Code Input Synthesis via Coverage-Guided Fuzzing

ICLR 2026 Conference Submission18454 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Evaluation, Software Testing, Fuzzing

Abstract: Evaluating the capabilities of large language models (LLMs) on practical, repository-level testing tasks is crucial for their effective application in software engineering. Many existing benchmarks rely on human-authored data such as issues, patches, and unit tests, which can limit scalability and introduce risks of solution leakage from training corpora. We introduce TTG-GEN, an automated framework for generating targeted test-input generation (TTG) problems from real-world codebases, in which LLMs are tasked with synthesizing input byte sequences to execute specific, designated code locations. These problems are representative of tasks performed by software engineers during debugging and are designed to probe an LLM’s understanding of complex control and data flow in real-world scenarios. TTG-GEN leverages coverage-guided fuzzing (CGF) to identify reachable yet non-trivial target locations that require structure-aware inputs to cover. By automatically generating TTG problems, TTG-GEN offers a practical, scalable, and continuously updatable framework with a low risk of direct solution leakage, suited for evaluating repository-level code comprehension. Using TTG-GEN, we construct TTG-BENCH-LITE, a benchmark of 500 such problems derived from 16 foundational C/C++ software projects. Our evaluation across retrieval-based and agent-based settings shows that even the most capable LLMs solve only 15% of these problems on their first attempt. This indicates that comprehending and manipulating program behavior at the repository level remains a significant hurdle for current models, highlighting a substantial gap between their current abilities and the proficiency required for complex software engineering tasks.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 18454

Loading