Training and Cross-Validating Machine Learning Pipelines with Limited Memory

Martin Hirzel; Kiran Kate; Louis Mandel; Avraham Shinnar

Training and Cross-Validating Machine Learning Pipelines with Limited Memory

Martin Hirzel, Kiran Kate, Louis Mandel, Avraham Shinnar

Published: 30 Apr 2024, Last Modified: 05 Sept 2024AutoML 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: cross-validation, machine learning, monoids, pipelines, batching, limited memory

TL;DR: This paper enables training and cross-validating pipelines on large data in batches by using monoids.

Abstract:

While automated machine learning (AutoML) can save human labor in finding well-performing pipelines, it often suffers from two problems: overfitting and using excessive resources. Unfortunately, the solutions are often at odds: cross-validation helps reduce overfitting at the expense of more resources; conversely, preprocessing on a separate compute cluster and then cross-validating only the final predictor saves resources at the expense of more overfitting. This paper shows how to train and cross-validate entire pipelines on a single moderate machine with limited memory by using monoids, which are associative, thus providing a flexible way for handling large data one batch at a time. To facilitate AutoML, our approach is designed to support the common sklearn APIs used by many AutoML systems for pipelines, training, cross-validation, and several operators. Abstracted behind those APIs, our approach uses task graphs to extend the benefits of monoids from operators to pipelines, and provides a multi-backend implementation. Overall, our approach lets users train and cross-validate pipelines on simple and inexpensive compute infrastructure.

Submission Checklist: Yes

Broader Impact Statement: Yes

Paper Availability And License: Yes

Code Of Conduct: Yes

Code And Dataset Supplement: zip

Optional Meta-Data For Green-AutoML: This blue field is just for structuring purposes and cannot be filled.

Submission Number: 24

Loading