Keywords: Benchmark, Code Generation, Deep Learning
TL;DR: DL-Bench is a 520-function benchmark showing LLMs struggle with deep-learning code
Abstract: Deep learning (DL) has revolutionized areas such as computer vision, natural language processing, and more. However, developing DL systems is challenging due to the complexity of DL workflows. Large Language Models (LLMs), such as GPT, Deepseek, Claude, Llama, Mistral, Qwen, etc., have emerged as promising tools to assist in DL code generation, offering potential solutions to these challenges. Despite this, existing benchmarks like DS-1000 are limited, as they primarily focus on small DL code snippets related to pre/post-processing tasks and lack comprehensive coverage of the full DL pipeline, including different DL phases and input data types. Similarly, MLE-bench focuses more on Machine Learning Engineering (MLE) tasks and broader ML workflows, without leveraging test cases.
To address this, we introduce DL-Bench, a novel benchmark dataset designed for function-level DL code generation. DL-Bench categorizes DL problems based on three key aspects: phases such as pre-processing, model construction, and training; tasks, including classification, regression, and recommendation; and input data types such as tabular, image, and text. DL-Bench diverges from related benchmarks, DS-1000 and AICoderEval, across four dimensions: it occupies a semantically distinct region for both prompts and code embedding, emphasizes DL constructs with a higher DL/ML token ratio, and requires more complex code solutions. State-of-the-art LLMs (e.g., O3-Mini, DeepSeak-V3) achieve, on average, significantly lower 28.5\% pass@1 score on DL-Bench than on DS-1000 (53.3\%). This result underscores DL-Bench`s greater challenging problems set. Our taxonomy of bugs found in LLM-generated DL code highlights the distinct challenges that LLMs face when generating DL code compared to general code.
Furthermore, our analysis reveals substantial performance variations across categories which emphasizes valuable insights that DL-Bench offers for potential improvement in the DL-specific generation. Our preliminary result shows that
DL-Bench can enhance LLM performance as a categorization training dataset, achieving an average 4.2\% improvement on DS-1000 with guided three-shot learning.
Overall, our empirical results demonstrate the utility of DL-Bench as a comprehensive benchmark while offering insights for future improvements across diverse functional categories.
Primary Area: datasets and benchmarks
Submission Number: 20806
Loading