T$^2$-RAGBench: Text-and-Table Aware Retrieval-Augmented Generation

T$^2$-RAGBench: Text-and-Table Aware Retrieval-Augmented Generation

ACL ARR 2025 May Submission803 Authors

15 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: While most financial documents contain a combination of textual and tabular information, robust Retrieval-Augmented Generation (RAG) systems are essential for effectively accessing and reasoning over such content to perform complex numerical tasks. This paper introduces \textbf{T$^2$-RAGBench}, a benchmark comprising \textbf{32,908} question-context-answer triples, designed to evaluate RAG methods on real-world financial data. Unlike typical QA datasets that operate under Oracle-context settings, where the relevant context is explicitly provided, T$^2$-RAGBench challenges models to first retrieve the correct context before conducting numerical reasoning. Existing QA datasets involving text and tables typically contain context-dependent questions, which may yield multiple correct answers depending on the provided context. To address this, we transform these datasets into a context-independent format, enabling reliable RAG evaluation. We conduct a comprehensive evaluation of popular RAG methods. Our analysis identifies \textit{Hybrid BM25}, a technique that combines dense and sparse vectors, as the most effective approach for text-and-table data. However, results demonstrate that T$^2$-RAGBench remains challenging even for SOTA LLMs and RAG methods. Further ablation studies examine the impact of embedding models and corpus size on retrieval performance. T$^2$-RAGBench provides a realistic and rigorous benchmark for existing RAG methods on text-and-table data. Code and dataset are available online: \href{https://anonymous.4open.science/r/g4kmu-paper-D5F8/README.md}.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: retrieval-augmented generation, corpus creation, benchmarking, evaluation, table QA, logical reasoning, financial/business NLP, question generation

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 803

Loading