Surprisingly Simple: Large Language Models are Zero-Shot Feature Extractors for Tabular and Text Data

27 Sept 2024 (modified: 02 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model, Tabular Data Prediction, Multimodal Learning
TL;DR: Large Language Models are Zero-Shot Feature Extractors for Tabular and Text Data
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their application to tabular data prediction remains relatively underexplored. This is partly due to the fact that recent LLMs are autoregressive models, generating text outputs. Converting tabular data into text, and vice versa, is not straightforward, making direct application of LLMs to complex tabular prediction difficult. Although previous works have utilized pre-trained embedding models like BERT and its variants for fine-tuning on tabular tasks, the potential of autoregressive LLMs for tabular prediction has been explored only on a limited scale and with simpler datasets. In this paper, we propose Zero-shot Encoding for Tabular data with LLMs (ZET-LLM), a surprisingly simple yet effective approach that leverages pre-trained LLMs as zero-shot feature extractors for tabular prediction tasks. To adapt autoregressive LLMs for this purpose, we replace autoregressive masking with bidirectional attention to treat them as feature embedding models. To address the challenge of encoding high-dimensional complex tabular data with LLMs' limited token lengths, we introduce a feature-wise serialization, where each feature is represented as a single token, and the resulting tokens are combined into a unified sample representation. Additionally, we apply missing value masking to handle missing data, a common issue in complex tabular datasets. We demonstrate that LLMs can serve as powerful zero-shot feature extractors without the need for fine-tuning, extensive data pre-processing, or task-specific instructions. Our method enables LLMs to process both structured tabular data and unstructured text data simultaneously, offering a unique advantage over traditional models. Extensive experiments on complex tabular datasets show that our approach outperforms state-of-the-art methods across binary classification, multi-class classification, and regression tasks.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8821
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview