TableHallu: A Benchmark for Uncovering Hallucinations in Query-Driven Table Generation

TableHallu: A Benchmark for Uncovering Hallucinations in Query-Driven Table Generation

ACL ARR 2026 January Submission9623 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Hallucination, Large Language Model, Table Hallucination, Faithfulness, Benchmark

Abstract: While Large Language Models (LLMs) excel at processing unstructured text, their reliability falters in structured data generation, leading to a critical issue we term table hallucination. Existing benchmarks, reliant on monolithic accuracy scores, fail to diagnose the specific ways models err. To address this, we introduce a systematic framework for understanding and evaluating this problem. Our contributions are threefold. First, we provide a formal definition and a comprehensive taxonomy of table hallucinations. Second, based on this taxonomy, we construct TableHallu, the first diagnostic benchmark for this task. TableHallu is built using a novel, scalable pipeline that programmatically injects distractors to create challenging test cases. This automated process, which eliminates the need for costly manual annotation, is proven to be over 95% accurate under human verification. Third, we conduct a comprehensive evaluation of state-of-the-art LLMs on TableHallu. The results reveal alarming and previously obscured vulnerabilities: models universally struggle with ordering constraints, frequently invent non-existent entities or attributes, and fail at elementary arithmetic during table generation. Our work provides the first systematic analysis of table hallucinations and a robust benchmark to steer future research from pursuing simple accuracy towards achieving verifiable, multi-faceted reliability. Code and data will be available.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, NLP datasets, automatic creation and evaluation of language resources, evaluation methodologies, metrics

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Submission Number: 9623

Loading