ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning TasksDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Large language models have shown promising performance in code generation benchmarks. However, a considerable divide exists between these benchmark achievements and their practical applicability, primarily attributed to real-world programming's reliance on pre-existing libraries. Instead of evaluating LLMs to code from scratch, this work aims to propose a new evaluation setup where LLMs use open-source libraries to finish machine learning tasks. Therefore, we propose ML-Bench, a benchmark to evaluate how well these models use open-source libraries for machine learning tasks. It includes 10,100 samples spanning 169 tasks from 18 GitHub repositories, requiring models to understand complex documents and code structures. Notably, while GPT-4 exhibits remarkable improvement over other LLMs, it completes only 33.82\% of tasks. Furthermore, we present an agent baseline ML-Agent, which navigates codebases and generates executable code effectively.
Paper Type: long
Research Area: NLP Applications
Contribution Types: Data resources
Languages Studied: English,
0 Replies

Loading