What Really Matters for Table LLMs? A Meta-Evaluation of Model and Data Effects

What Really Matters for Table LLMs? A Meta-Evaluation of Model and Data Effects

ACL ARR 2025 May Submission1553 Authors

17 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this work, we revisit this trajectory and highlight emerging challenges in the LLM era, particularly the paradox of choice: the difficulty of attributing performance gains amid diverse base models and training sets. We replicate four table LLMs by instruction-tuning three foundation models on four existing datasets, yielding 12 models. We then evaluate these models across 16 tables benchmarks. Our analysis reveals that while training data plays a role, base model selection is important, and in many cases, dominates performance. Generalization and reasoning remain challenging, inviting future effort on table modeling. Based on our findings, we share our thoughts on the future directions for table modeling.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: table instruction tuning, table LLMs, generalization, replication, OOD evaluation

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study

Languages Studied: English

Submission Number: 1553

Loading