DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale

ACL ARR 2025 February Submission3416 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40% of observed runtime errors on the generated repository. To address this, we introduce DI-Bench, a large-scale benchmark and evaluation framework specifically designed to assess LLMs' capability on dependency inference. The benchmark features 600 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 48% execution pass rate on Python, indicating significant room for improvement. DI-Bench establishes a new viewpoint for evaluating LLM performance on repositories, paving the way for more robust end-to-end software synthesis (code: https://github.com/DIBench/DIBench).
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, NLP datasets, evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: Programming Languages, English Language
Submission Number: 3416
Loading