Beyond Extraction: Contextualising Tabular Data for Efficient Summarisation by Language Models

ACL ARR 2024 June Submission345 Authors

10 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The conventional use of the Retrieval-Augmented Generation (RAG) architecture has proven effective for retrieving information from diverse documents. However, challenges arise in handling complex table queries, especially within PDF documents containing intricate tabular structures. Our work introduces an innovative approach to enhance the accuracy of complex table queries in RAG-based systems. Our methodology involves storing PDFs in the retrieval database and extracting tabular content separately. The extracted tables undergo a process of context enrichment, concatenating headers with corresponding values. Furthermore, we enhance the tabular data with contextual understanding using the GPT-3.5-turbo through a one-shot prompt. This enriched data is then added to the retrieval database alongside other PDFs. Our approach aims to significantly improve the accuracy of complex table queries, offering a solution to a longstanding challenge in information retrieval.
Paper Type: Short
Research Area: Summarization
Research Area Keywords: Summarization, Information Extraction, Information Retrieval
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Surveys
Languages Studied: English
Submission Number: 345
Loading