Keywords: Hallucination, Large Language Model, Table Hallucination, Faithfulness, Benchmark
Abstract: While Large Language Models (LLMs) excel at processing unstructured text, their reliability falters in structured data generation, leading to a critical issue we term table hallucination. Existing benchmarks, reliant on monolithic accuracy scores, fail to diagnose the specific ways models err. To address this, we introduce a systematic framework for understanding and evaluating this problem. Our contributions are threefold. First, we provide a formal definition and a comprehensive taxonomy of table hallucinations. Second, based on this taxonomy, we construct TableHallu, the first diagnostic benchmark for this task. TableHallu is built using a novel, scalable pipeline that programmatically injects distractors to create challenging test cases. This automated process, which eliminates the need for costly manual annotation, is proven to be over 95% accurate under human verification. Third, we conduct a comprehensive evaluation of state-of-the-art LLMs on TableHallu. The results reveal alarming and previously obscured vulnerabilities: models universally struggle with ordering constraints, frequently invent non-existent entities or attributes, and fail at elementary arithmetic during table generation. Our work provides the first systematic analysis of table hallucinations and a robust benchmark to steer future research from pursuing simple accuracy towards achieving verifiable, multi-faceted reliability. Code and data will be available.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, NLP datasets, automatic creation and evaluation of language resources, evaluation methodologies, metrics
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 9623
Loading