DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale

DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale

ACL ARR 2025 February Submission3416 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40% of observed runtime errors on the generated repository. To address this, we introduce DI-Bench, a large-scale benchmark and evaluation framework specifically designed to assess LLMs' capability on dependency inference. The benchmark features 600 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 48% execution pass rate on Python, indicating significant room for improvement. DI-Bench establishes a new viewpoint for evaluating LLM performance on repositories, paving the way for more robust end-to-end software synthesis (code: https://github.com/DIBench/DIBench).

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, NLP datasets, evaluation

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources

Languages Studied: Programming Languages, English Language

Submission Number: 3416

Loading