JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks

Yiran Wang; José Antonio Hernández López; Ulf Nilsson; Dániel Varró

JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks

Yiran Wang, José Antonio Hernández López, Ulf Nilsson, Dániel Varró

Published: 14 May 2026, Last Modified: 14 May 2026AIWare 2026 Benchmark and DatasetEveryoneRevisionsCC BY 4.0

Keywords: benchmark dataset, software bugs, machine learning, Jupyter notebooks

Abstract: Jupyter notebooks are widely used for machine learning (ML) prototyping and experimentation, yet debugging support for notebook-based ML development remains limited, partly due to the lack of realistic and executable bug benchmarks. We introduce JunoBench, to our knowledge the first executable benchmark of real-world crashes in Python-based ML notebooks. JunoBench contains 111 curated and reproducible crashes from public Kaggle notebooks, each paired with a verified fix. The benchmark covers widely used ML libraries (e.g., TensorFlow/Keras, PyTorch, and Scikit-learn) as well as notebook-specific failures such as out-of-order execution. To ensure reproducibility and ease of evaluation, JunoBench provides a unified execution environment that reliably reproduces all crashes. In addition, each crash is accompanied by human-validated annotations, including library cause, crash type, root cause, ML pipeline stage, and natural-language diagnostic summaries. By combining realistic crashes, verified fixes, structured labels, and reproducible execution, JunoBench enables systematic evaluation of crash detection, diagnosis, and automated repair techniques for notebook-based ML development.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 6

Loading