Fine-Tuning LLMs for Automated Feature Engineering

Yoichi Hirose; Kento Uchida; Shinichi Shirakawa

Fine-Tuning LLMs for Automated Feature Engineering

Yoichi Hirose, Kento Uchida, Shinichi Shirakawa

Published: 12 Jul 2024, Last Modified: 16 Aug 2024AutoML 2024 WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Automated Feature Engineering; Large Language Models; Fine Tuning

TL;DR: Fine-tune LLMs and enhance the stability of code generated by LLMs for automated feature engineering

Abstract: Automated machine learning (AutoML) significantly reduces human effort in developing machine learning systems. However, automated feature engineering (AutoFE) for tabular datasets, an important topic in AutoML, is still challenging because it requires exploiting contextual knowledge, such as dataset descriptions and domain expertise. To address this issue, previous work has introduced a framework that utilizes large language models (LLMs) to generate code for feature engineering, taking contextual knowledge as input. Upon evaluating this framework, we observed that LLMs often generate code that is non-executable. This paper provides a novel dataset for fine-tuning LLMs to improve the stability of code generation for feature engineering. We created candidate features by iteratively applying predefined operations to input features in publicly available tabular datasets. Subsequently, we evaluated the effectiveness of each candidate feature by training machine learning models with the feature-appended datasets. The top features that improve predictive performance for each dataset were selected and paired with metadata from the corresponding dataset. In the experiment, we demonstrate that the fine-tuned LLMs using the proposed dataset succeed in stably generating valid code for feature engineering. The experimental result shows that smaller LLMs with fine-tuning exhibit better stability to their larger counterparts without fine-tuning.

Submission Checklist: Yes

Broader Impact Statement: Yes

Paper Availability And License: Yes

Code Of Conduct: Yes

Optional Meta-Data For Green-AutoML: All questions below on environmental impact are optional.

Submission Number: 28

Loading