ML²B: Multi-Lingual ML Benchmark For AutoML

19 Sept 2025 (modified: 06 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multilingual machine learning, Large language models, Cross-lingual representation learning, Code generation, Machine learning workflows, Benchmark dataset
Abstract: Large language models (LLMs) have recently demonstrated strong capabilities in generating machine learning (ML) code, enabling end-to-end pipeline construction from natural language instructions. However, existing benchmarks for ML code generation are mainly restricted to English, overlooking the global and multilingual nature of ML research and practice. To address this gap, we present ML²B, the first benchmark for evaluating multilingual ML code generation. ML²B consists of 35 Kaggle competitions in 13 natural languages, covering tabular, text, and image data types, with structured metadata and validated human-reviewed translations. For evaluation, we employ AIDE, an automated framework for end-to-end assessment of data science pipelines, and provide insights into cross-lingual model performance. Overall, the results indicate that cross-lingual performance remains unstable, even for languages with substantial training data. The benchmark, evaluation framework, and comprehensive results are made available through our GitHub repository to facilitate future research in multilingual ML code generation: https://github.com/AnonimusCoders/ML²B.
Primary Area: datasets and benchmarks
Submission Number: 20002
Loading