Abstract: Since many real-world documents combine textual and tabular data, robust Retrieval Augmented Generation (RAG) systems are essential for effectively accessing and analyzing such content to support complex reasoning tasks.
Therefore, this paper introduces $\textbf{T$^2$-RAGBench}$, a benchmark comprising $\textbf{23,088}$ question-context-answer triples, designed to evaluate RAG methods on real-world text-and-table data.
Unlike typical QA datasets that operate under Oracle-Context settings, $\textbf{T$^2$-RAGBench}$ challenges models to first retrieve the correct context before conducting numerical reasoning. Existing QA datasets containing text-and-table data typically contain context-dependent questions, which may yield multiple correct answers depending on the provided context.
To address this, we transform SOTA datasets into a context-independent format, validated by experts as 91.3% context-independent questions, enabling reliable RAG evaluation.
Our comprehensive evaluation identifies $\textit{Hybrid BM25}$, a technique that combines dense and sparse vectors, as the most effective approach for text-and-table data. However, results demonstrate that $\textbf{T$^2$-RAGBench}$ remains challenging even for SOTA LLMs and RAG methods. Further ablation studies examine the impact of embedding models and corpus size on retrieval performance.
$\textbf{T$^2$-RAGBench}$ provides a realistic and rigorous benchmark for existing RAG methods on text-and-table data. Code and dataset are available online: https://anonymous.4open.science/r/g4kmu-paper-D5F8/README.md.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: retrieval-augmented generation, corpus creation, benchmarking, evaluation, table QA, logical reasoning, financial/business NLP, question generation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Previous URL: https://openreview.net/forum?id=qs21As1YTW
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).
Reassignment Request Reviewers: Yes, I want a different set of reviewers
Justification For Not Keeping Action Editor Or Reviewers: Except for one reviewer, we received reviewers with low confidence scores, and one reviewer had difficulty understanding the overall concept of the paper. Additionally, none of the reviewers were particularly interested in the retrieval; instead, they all focused more on the performance of the generator, which is less critical from our perspective. We would like to have a more in-depth discussion about RAG evaluation, rather than the general capabilities of LLMs.
Data: zip
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: No
A2 Elaboration: Paper is about RAG on text-and-table documents, which contain no personalized or offensive content.
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 2
B2 Discuss The License For Artifacts: No
B2 Elaboration: All used datasets are open-sourced and free to use.
B3 Artifact Use Consistent With Intended Use: No
B3 Elaboration: Datasets was modified, but to solve the same numerical task.
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: No personal data in the dataset.
B5 Documentation Of Artifacts: No
B5 Elaboration: While the dataset only contain financial documents, there are no domains, languages or linguistic phenomena relevant.
B6 Statistics For Data: Yes
B6 Elaboration: 4
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: 6
C2 Experimental Setup And Hyperparameters: No
C2 Elaboration: We do not train models, therefore all hyperparameter for RAG and experiments are provided in Evaluation.
C3 Descriptive Statistics: Yes
C3 Elaboration: 6
C4 Parameters For Packages: Yes
C4 Elaboration: 5
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: Appendix
D2 Recruitment And Payment: Yes
D2 Elaboration: 4.2
D3 Data Consent: No
D3 Elaboration: No research human subjects were conducted.
D4 Ethics Review Board Approval: N/A
D4 Elaboration: No research human subjects were conducted.
D5 Characteristics Of Annotators: Yes
D5 Elaboration: Appendix
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: According to the latest ACL Policy regarding the use, we must not report AI usage for coding or writing aid tools. We primarily used GitHub CoPilot and Grammarly, plus ChatGPT for Latex Table or Plot styling.
Author Submission Checklist: yes
Submission Number: 1247
Loading