Abstract: Recent advances in large language models (LLMs) have enabled promising results in generating executable code from natural language. However, existing benchmarks typically rely on synthetic prompts or constrained domains, limiting insight into LLM performance on realistic machine learning (ML) workflows. We introduce GenMLBench, a domain-diverse benchmark for evaluating language-to-code generation in the context of ML pipeline creation. GenMLBench extends the Code4ML corpus with natural language task descriptions and structured metadata derived from 50 Kaggle competitions across domains including finance, healthcare, and computer vision. We evaluate LLMs using an open-source code-generation framework, applying standardized execution constraints and metric validation. Our analysis reveals key failure modes, such as hallucinations and data leakage, and highlights variation in success across data modalities and task types. GenMLBench provides a rigorous testbed for future research on robust, agent-based ML code generation.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, NLP datasets, evaluation methodologies, reproducibility
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 7069
Loading