T$^2$-RAGBench: Text-and-Table Aware Retrieval-Augmented Generation

T$^2$-RAGBench: Text-and-Table Aware Retrieval-Augmented Generation

ACL ARR 2025 July Submission1247 Authors

29 Jul 2025 (modified: 04 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Since many real-world documents combine textual and tabular data, robust Retrieval Augmented Generation (RAG) systems are essential for effectively accessing and analyzing such content to support complex reasoning tasks. Therefore, this paper introduces $\textbf{T$^2$-RAGBench}$, a benchmark comprising $\textbf{23,088}$ question-context-answer triples, designed to evaluate RAG methods on real-world text-and-table data. Unlike typical QA datasets that operate under Oracle-Context settings, $\textbf{T$^2$-RAGBench}$ challenges models to first retrieve the correct context before conducting numerical reasoning. Existing QA datasets containing text-and-table data typically contain context-dependent questions, which may yield multiple correct answers depending on the provided context. To address this, we transform SOTA datasets into a context-independent format, validated by experts as 91.3% context-independent questions, enabling reliable RAG evaluation. Our comprehensive evaluation identifies $\textit{Hybrid BM25}$, a technique that combines dense and sparse vectors, as the most effective approach for text-and-table data. However, results demonstrate that $\textbf{T$^2$-RAGBench}$ remains challenging even for SOTA LLMs and RAG methods. Further ablation studies examine the impact of embedding models and corpus size on retrieval performance. $\textbf{T$^2$-RAGBench}$ provides a realistic and rigorous benchmark for existing RAG methods on text-and-table data. Code and dataset are available online: https://anonymous.4open.science/r/g4kmu-paper-D5F8/README.md.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: retrieval-augmented generation, corpus creation, benchmarking, evaluation, table QA, logical reasoning, financial/business NLP, question generation

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Previous URL: https://openreview.net/forum?id=qs21As1YTW

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).

Reassignment Request Reviewers: Yes, I want a different set of reviewers

Justification For Not Keeping Action Editor Or Reviewers: Except for one reviewer, we received reviewers with low confidence scores, and one reviewer had difficulty understanding the overall concept of the paper. Additionally, none of the reviewers were particularly interested in the retrieval; instead, they all focused more on the performance of the generator, which is less critical from our perspective. We would like to have a more in-depth discussion about RAG evaluation, rather than the general capabilities of LLMs.

Data: zip

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: No

A2 Elaboration: Paper is about RAG on text-and-table documents, which contain no personalized or offensive content.

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: 2

B2 Discuss The License For Artifacts: No

B2 Elaboration: All used datasets are open-sourced and free to use.

B3 Artifact Use Consistent With Intended Use: No

B3 Elaboration: Datasets was modified, but to solve the same numerical task.

B4 Data Contains Personally Identifying Info Or Offensive Content: No

B4 Elaboration: No personal data in the dataset.

B5 Documentation Of Artifacts: No

B5 Elaboration: While the dataset only contain financial documents, there are no domains, languages or linguistic phenomena relevant.

B6 Statistics For Data: Yes

B6 Elaboration: 4

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: 6

C2 Experimental Setup And Hyperparameters: No

C2 Elaboration: We do not train models, therefore all hyperparameter for RAG and experiments are provided in Evaluation.

C3 Descriptive Statistics: Yes

C3 Elaboration: 6

C4 Parameters For Packages: Yes

C4 Elaboration: 5

D Human Subjects Including Annotators: Yes

D1 Instructions Given To Participants: Yes

D1 Elaboration: Appendix

D2 Recruitment And Payment: Yes

D2 Elaboration: 4.2

D3 Data Consent: No

D3 Elaboration: No research human subjects were conducted.

D4 Ethics Review Board Approval: N/A

D4 Elaboration: No research human subjects were conducted.

D5 Characteristics Of Annotators: Yes

D5 Elaboration: Appendix

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: No

E1 Elaboration: According to the latest ACL Policy regarding the use, we must not report AI usage for coding or writing aid tools. We primarily used GitHub CoPilot and Grammarly, plus ChatGPT for Latex Table or Plot styling.

Author Submission Checklist: yes

Submission Number: 1247

Loading