Exploring the Robustness of Language Models for Tabular Question Answering via Attention Analysis

Published: 24 Aug 2025, Last Modified: 24 Aug 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs), already shown to ace various unstructured text comprehension tasks, have also remarkably been shown to tackle table (structured) comprehension tasks without specific training. Building on earlier studies of LLMs for tabular tasks, we probe how in-context learning (ICL), model scale, instruction tuning, and domain bias affect Tabular QA (TQA) robustness by testing LLMs, under diverse augmentations and perturbations, on diverse domains: Wikipedia-based $\textbf{WTQ}$, financial $\textbf{TAT-QA}$, and scientific $\textbf{SCITAB}$. Although instruction tuning and larger, newer LLMs deliver stronger, more robust TQA performance, data contamination and reliability issues, especially on $\textbf{WTQ}$, remain unresolved. Through an in-depth attention analysis, we reveal a strong correlation between perturbation-induced shifts in attention dispersion and the drops in performance, with sensitivity peaking in the model's middle layers. We highlight the need for improved interpretable methodologies to develop more reliable LLMs for table comprehension. Through an in-depth attention analysis, we reveal a strong correlation between perturbation-induced shifts in attention dispersion and performance drops, with sensitivity peaking in the model's middle layers. Based on these findings, we argue for the development of structure-aware self-attention mechanisms and domain-adaptive processing techniques to improve the transparency, generalization, and real-world reliability of LLMs on tabular data.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Since the original TMLR submission, we have made substantial revisions to address reviewer feedback. We expanded the dataset characterization to better justify the choice of WTQ, TAT-QA, and SCITAB, and provided an explicit description of table size controls to isolate structural effects from long-context limitations. We clarified the role of our value perturbations (NT, NVP, DVP, RVP) as practical probes for memorization and fidelity, and added more detailed descriptions of perturbation generation. The results section was reorganized with clearer comparisons across model families, sizes, and tuning styles, and supplemented with standard deviations for statistical rigor(added as a large table within the appendix for better clarity). We improved the clarity of the attention analysis and added new robustness results for TAPAS and TAPEX models, as well as for Llama2, Qwen2.5, and Qwen3 models. We also enhanced figure and table captions for interpretability and streamlined the discussion of limitations and future directions. More detailed results from new experiments are provided in the Appendix.
Code: https://github.com/KBhandari11/RobustTableQA
Assigned Action Editor: ~Peilin_Zhao2
Submission Number: 5174
Loading