TQA-Bench: Evaluating LLMs for Multi-Table Question Answering

16 Sept 2025 (modified: 28 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Table Question Answering, Large Language Model
Abstract: The advance of large language models (LLMs) has unlocked great opportunities in complex data management tasks, particularly in question answering (QA) over complicated multi-table relational data. Despite significant progress, systematically evaluating LLMs on multi-table QA remains a critical challenge due to the inherent complexity of analyzing heterogeneous table structures and the potentially large scale of serialized tabular data. Existing benchmarks primarily focus on single-table QA, failing to capture the intricacies of connections across multiple relational tables, as required in real-world domains such as finance, healthcare, and e-commerce. To bridge this gap, we present TQA-Bench, a new multi-table QA benchmark designed to evaluate the capabilities of LLMs in tackling complex QA tasks over complicated relational data. Our benchmark incorporates diverse relational database instances sourced from real-world public datasets and introduces a flexible sampling mechanism to create tasks with varying multi-table context lengths, ranging from 8K to 64K tokens. To further ensure robustness and reliability, we integrate symbolic extensions into the evaluation framework, enabling the assessment of LLM reasoning capabilities beyond simple data retrieval or probabilistic pattern matching. We systematically evaluate a range of LLMs, both closed-source and open-source, spanning model scales from 2 billion to 671 billion parameters. Our extensive experiments reveal critical insights into the performance of LLMs in multi-table QA, highlighting both challenges and opportunities for advancing their application in complex, data-driven environments.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 7610
Loading