IngesTables: Scalable and Efficient Training of LLM-Enabled Tabular Foundation Models
Keywords: tabular, foundation model, structured data, LLM
TL;DR: We propose a scalable and efficient way to train Tabular Foundation Models using LLM Embeddings.
Abstract: There is a massive amount of tabular data that can be taken advantage of via `foundation models' to improve prediction performance for downstream tabular prediction tasks. However, numerous challenges constitute bottlenecks in building tabular foundation models, including learning semantic relevance between tables and features, mismatched schemes, arbitrarily high cardinality for categorical values, and scalability to many tables, rows and features. We propose IngesTables, a novel canonical tabular foundation model building framework, designed to address the aforementioned challenges. IngesTables employs LLMs to encode representations of table/feature semantics and the relationships, that are then modeled via an attention-based tabular architecture. Unlike other LLM-based approaches, IngesTables is much cheaper to train and faster to run inference, because of how LLM-generated embeddings are defined and cached. We show that IngesTables demonstrates significant improvements over commonly-used models like XGBoost on clinical trial datasets in standard supervised learning settings, and is competitive with tabular prediction models that are specialized for clinical trial datasets without incurring LLM-level cost and latency.
Submission Number: 26