Keywords: Representation Learning, Multimodal Reasoning, Symbolic Program Synthesis, Exploratory Data Analysis
TL;DR: We introduce PIFE, an AutoFE framework that leverages iterative EDA with multimodal language models and symbolic synthesis to generate interpretable, high-quality features that improve tabular prediction.
Abstract: Despite significant advances in Automated Machine Learning (AutoML), one of its persistent blind spots remains the automation of data-centric tasks such as exploratory data analysis (EDA), contextual insight extraction, and feature engineering. These steps-often more critical than model selection itself-are still largely manual, domain-specific, and reliant on human intuition. Existing automated feature engineering (AutoFE) techniques either rely on rigid transformation sets or complex optimization strategies that struggle with interpretability and fail to leverage the rich, visual cues that guide human decision-making. In this work, we introduce PIFE: Progressive Insight driven Feature Engineering via Multimodal Reasoning; a novel AutoFE framework that employs multimodal language models as collaborative agents in an iterative pipeline. PIFE systematically performs automated EDA, generating statistical summaries and visualizations that are jointly interpreted through text–vision reasoning. These multimodal insights inform the synthesis of candidate transformations, represented as symbolic programs in executable Python code to ensure interpretability and reproducibility. By coupling iterative insight extraction with validation-driven refinement, PIFE produces high-quality, interpretable features that consistently enhance the performance of diverse predictive models, outperforming existing AutoFE baselines. Extensive experiments across diverse tabular datasets demonstrate the effectiveness and adaptability of our approach, paving the way for a new class of human-aligned, insight-aware AutoFE systems.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 20500
Loading