TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

ACL ARR 2026 January Submission7271 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Tabular Embedding Model, Tabular Embedding Benchmark, Tabular Retrieval

Abstract: Foundation models have established unified representations for natural language processing, yet this paradigm remains largely unexplored for tabular data. Existing methods face fundamental limitations: LLM-based approaches lack retrieval-compatible vector outputs, whereas text embedding models often fail to capture tabular structure and numerical semantics. To bridge this gap, we first introduce the Tabular Embedding Benchmark (TabBench), a comprehensive suite designed to evaluate the tabular understanding capability of embedding models. We then propose TabEmbed, the first generalist embedding model that unifies tabular classification and retrieval within a shared embedding space. By reformulating diverse tabular tasks as semantic matching problems, TabEmbed leverages large-scale contrastive learning with positive-aware hard negative mining to discern fine-grained structural and numerical nuances. Experimental results on TabBench demonstrate that TabEmbed significantly outperforms state-of-the-art text embedding models, establishing a new baseline for universal tabular representation learning.

Paper Type: Long

Research Area: Information Extraction and Retrieval

Research Area Keywords: dense retrieval,document representation

Languages Studied: English

Submission Number: 7271

Loading