Automated Benchmark Generation for Repository-Level Code Input Synthesis via Coverage-Guided Fuzzing
Keywords: Evaluation, Software Testing, Fuzzing
Abstract: Evaluating the capabilities of large language models (LLMs) on practical, repository-level testing tasks is crucial for their effective application in software engineering. Many existing benchmarks rely on human-authored data such as issues, patches, and unit tests, which can limit scalability and introduce risks of solution leakage from training corpora. We introduce TTG-GEN, an automated
framework for generating targeted test-input generation (TTG) problems from real-world codebases, in which LLMs are tasked with synthesizing input byte sequences to execute specific, designated code locations. These problems are representative of tasks performed by software engineers during debugging and are designed to probe an LLM’s understanding of complex control and data flow
in real-world scenarios. TTG-GEN leverages coverage-guided fuzzing (CGF) to identify reachable yet non-trivial target locations that require structure-aware inputs to cover. By automatically generating TTG problems, TTG-GEN offers a practical, scalable, and continuously updatable framework with a low risk of direct solution leakage, suited for evaluating repository-level code comprehension.
Using TTG-GEN, we construct TTG-BENCH-LITE, a benchmark of 500 such problems derived from 16 foundational C/C++ software projects. Our evaluation across retrieval-based and agent-based settings shows that even the most capable LLMs solve only 15% of these problems on their first attempt. This indicates that comprehending and manipulating program behavior at the repository level
remains a significant hurdle for current models, highlighting a substantial gap between their current abilities and the proficiency required for complex software engineering tasks.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 18454
Loading