Keywords: Tabular data learning, large language models, generative tabular learning, scaling laws
TL;DR: Scaling generative tabular learning for large language models to obtain significantly improved performance on universal tabular learning
Abstract: Developing predictive models for tabular data is essential across many industrial applications. The primary challenge in addressing these tasks lies in handling heterogeneous data schemas and diverse prediction targets. Recently, generative tabular learning (GTL) was developed to leverage the instruction-following paradigm of large language models (LLMs) to enable universal tabular learning across varied datasets. This method facilitates effective prompt-based transfers to downstream tasks without the need for supervised tuning. However, the full potential of GTL-enhanced LLMs remains largely unexplored due to limitations in dataset size, sequence length, and model architecture, leading to notable performance gaps compared to traditional tuning-based tabular models as the number of training examples increases. In this study, we aim to unlock the full potential of GTL from a scaling perspective. We expanded the pre-training datasets from 340 to 972, extended the sequence length from 4,096 to 16,384 tokens, and experimented with different base LLMs. Our findings reveal that scaling datasets and prediction tasks generally enhances generalization, although regression tasks tend to reach saturation quickly. Increasing the number of in-context samples consistently improves performance, especially during inference. Our optimized LLMs demonstrate significant improvements, effectively closing the gap with and even surpassing highly-optimized models when dealing with larger training samples.
Submission Number: 65
Loading