Keywords: Automated Feature Engineering; Large Language Models; Fine Tuning
TL;DR: Fine-tune LLMs and enhance the stability of code generated by LLMs for automated feature engineering
Abstract: Automated machine learning (AutoML) significantly reduces human effort in developing machine learning systems. However, automated feature engineering (AutoFE) for tabular datasets, an important topic in AutoML, is still challenging because it requires exploiting contextual knowledge, such as dataset descriptions and domain expertise. To address this issue, previous work has introduced a framework that utilizes large language models (LLMs) to generate code for feature engineering, taking contextual knowledge as input. Upon evaluating this framework, we observed that LLMs often generate code that is non-executable. This paper provides a novel dataset for fine-tuning LLMs to improve the stability of code generation for feature engineering. We created candidate features by iteratively applying predefined operations to input features in publicly available tabular datasets. Subsequently, we evaluated the effectiveness of each candidate feature by training machine learning models with the feature-appended datasets. The top features that improve predictive performance for each dataset were selected and paired with metadata from the corresponding dataset. In the experiment, we demonstrate that the fine-tuned LLMs using the proposed dataset succeed in stably generating valid code for feature engineering. The experimental result shows that smaller LLMs with fine-tuning exhibit better stability to their larger counterparts without fine-tuning.
Submission Checklist: Yes
Broader Impact Statement: Yes
Paper Availability And License: Yes
Code Of Conduct: Yes
Optional Meta-Data For Green-AutoML: All questions below on environmental impact are optional.
Submission Number: 28
Loading