Semi-supervised Tabular Classification via In-context Learning of Large Language Models
Keywords: Tabular representation learning, Semi-supervised learning, In-context learning, Large language models
TL;DR: We propose a simple yet powerful semi-supervised tabular learning framework that exploits the in-context learning capabilities of large language models to effectively extract transferable knowledge from unlabeled tables.
Abstract: Learning with limited labeled tabular samples is an important problem for industrial machine learning applications, as acquiring annotations for tabular data is often too costly. On the other hand, recent remarkable progress in natural language processing has evidenced that such an issue can be circumvented by using pre-trained large language models (LLMs). Motivated by this, we ask whether LLMs can help to handle the limited labeled data in the tabular domain as well. As a positive answer, we propose a novel semi-supervised tabular learning framework, coined Self-generated PROmpts from Unlabeled Tables (SPROUT), which utilizes unlabeled data in conjunction with LLMs. Our main idea is to exploit the in-context learning capabilities of LLMs to effectively extract transferable knowledge from unlabeled tabular samples. Specifically, SPROUT generates in-context prompts from unlabeled tables by identifying a column feature that exhibits a strong correlation with the actual target label, thereby creating examples that pertain to the true target tasks. In addition, we demonstrate how a language prior can facilitate knowledge transfer from heterogeneous data sources, enhancing performance of target datasets and mitigating the challenges posed by varying input formats. Experimental results show that SPROUT yields substantial performance improvements over previous methods across various tabular benchmarks.
Submission Number: 16