GenMLBench: A Domain-Diverse Benchmark for Evaluating Language-to-Code Generation with LLMs

GenMLBench: A Domain-Diverse Benchmark for Evaluating Language-to-Code Generation with LLMs

ACL ARR 2025 May Submission7069 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent advances in large language models (LLMs) have enabled promising results in generating executable code from natural language. However, existing benchmarks typically rely on synthetic prompts or constrained domains, limiting insight into LLM performance on realistic machine learning (ML) workflows. We introduce GenMLBench, a domain-diverse benchmark for evaluating language-to-code generation in the context of ML pipeline creation. GenMLBench extends the Code4ML corpus with natural language task descriptions and structured metadata derived from 50 Kaggle competitions across domains including finance, healthcare, and computer vision. We evaluate LLMs using an open-source code-generation framework, applying standardized execution constraints and metric validation. Our analysis reveals key failure modes, such as hallucinations and data leakage, and highlights variation in success across data modalities and task types. GenMLBench provides a rigorous testbed for future research on robust, agent-based ML code generation.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, NLP datasets, evaluation methodologies, reproducibility

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English

Submission Number: 7069

Loading