Unlocking the Power of LLMs for Efficiently Automatic Extract Information from Hybrid Long Documents
Abstract: Information extraction is a vital task in natural language processing. It involves extracting user-interesting information from natural language and serves many downstream tasks, including knowledge graphs, information retrieval, and question-answering systems. Given LLMs' robust comprehension and reasoning across diverse tasks, their potential for this task is substantial. However, applying LLMs directly for complex documents faces challenges, including handling lengthy documents, understanding tables, adapting to representation ambiguity, and ensuring numerical precision. Given the absence of comprehensive datasets encompassing these challenges, we introduce the Financial Reports Numerical Extraction (FINE) dataset to facilitate further investigation. We present the Split-Recombination Framework (SiReF) that effectively counters these challenges with table serialization, embedding retrieval, and precision prompts. Extensive experiment results demonstrate its adaptability across various domains and LLMs with different capabilities. The dataset and code are provided in the attachments.
Paper Type: long
Research Area: Information Extraction
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies
Loading