Modeling string entries for tabular data prediction: do we need big large language models?

Leo Grinsztajn; Myung Jun Kim; Edouard Oyallon; Gael Varoquaux

Modeling string entries for tabular data prediction: do we need big large language models?

Leo Grinsztajn, Myung Jun Kim, Edouard Oyallon, Gael Varoquaux

Published: 28 Oct 2023, Last Modified: 16 Nov 2023TRL @ NeurIPS 2023 PosterEveryoneRevisionsBibTeX

Keywords: tabular data, embeddings, language models

Abstract: Tabular data are often characterized by numerical and categorical features. But these features co-exist with features made of text entries, such as names or descriptions. Here, we investigate whether language models can extract information from these text entries. Studying 19 datasets and varying training sizes, we find that using language model to encode text features improve predictions upon no encodings and character-level approaches based on substrings. Furthermore, we find that larger, more advanced language models translate to more significant improvements.

Submission Number: 50

Loading