Abstract: Interpretability in Table Question Answering (Table QA) is critical, especially in high-stakes domains like finance and healthcare. While recent Table QA approaches based on Large Language Models (LLMs) achieve high accuracy, they often produce ambiguous explanations of how answers are derived. We propose Plan-of-SQLs (POS), a new Table QA method that makes the model's decision-making process interpretable. POS decomposes a question into a sequence of atomic steps, each directly translated into an executable SQL command on the table, thereby ensuring that every intermediate result is transparent. Through extensive experiments, we show that: First, POS generates the highest-quality explanations among compared methods, which markedly improves the users' ability to simulate and verify the model’s decisions. Second, when evaluated on standard Table QA benchmarks (TabFact, WikiTQ, and FeTaQA), POS achieves QA accuracy that is competitive to existing methods, while also offering greater efficiency—requiring significantly fewer LLM calls and table database queries (up to 25x fewer)—and more robust performance on large-sized tables. Finally, we observe high agreement (up to 90.59% in forward simulation) between LLMs and human users when making decisions based on the same explanations, suggesting that LLMs could serve as an effective proxy for humans in evaluating Table QA explanations.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Changes:
- We have added a paragraph on PAL in Appendix B of the revised manuscript (added text is in teal color).
- We have added an error analysis into Appendix J.2 of our revised manuscript.
- We have added a discussion of atomic-step coverage: “Can every table-based question be decomposed into a set of atomic steps?” in Appendix C.
- We have added a discussion of samples that POS cannot process: “What tabular questions are not SQL-decomposable?” in Appendix C.
- We have added a new ablation study for POS accuracy in Appendix C.
- We have added the discussion “SQL or Python; which is better for Table QA?” in Appendix C to explain why we chose SQL rather than Python.
- We have expanded our discussion for one-step planning in Appendix J.3 to make it more clear why one-step planning is more effective.
- We have revised Section 5.
Assigned Action Editor: ~Binhang_Yuan1
Submission Number: 4670
Loading