TableBench: A Capability-Based Table Benchmark for Large Language Models

ACL ARR 2024 June Submission5551 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The rapid advancement of techniques in large language models (LLMs) for processing tabular data necessitates improvements in evaluation benchmarks. However, most of existing table benchmarks offer evaluation from a singular task-based perspective, failing to provide a comprehensive and meticulous assessment of the LLMs' table-related capabilities. To address this gap, we introduce TableBench, a capability-based benchmark tailored to evaluate the performance of LLMs on tabular data. Our framework intricately outlines 10 essential capabilities required from the point a model receives a table-related input to the generation of an output, with each capability tested across 6 table formats. We evaluate 20 models using TableBench and observe that GPT-4 and GPT-4o achieve the highest scores, while phi3-small outperform other open-source models of similar scale. Drawing from our evaluation, we present a series of valuable insights, which can serve as a pivotal reference for future table-related LLM research.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmark, large language model, evaluation, tabular data
Contribution Types: Model analysis & interpretability, Reproduction study, Data resources, Data analysis
Languages Studied: English
Submission Number: 5551
Loading