Information Extraction from PDF Tables with Large Language ModelsDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Tables, found in PDF documents, contain valuable quantitative information. Unfortunately, extracting this information is difficult due to high variability in the table structure as well as content. We propose statements, a novel data-structure to self-contain quantitative facts and related information. We propose translating tables to statements as a new supervised deep-learning information extraction task. We introduce SemTabNet -- a dataset of over 100K annotated tables. Investigating a family of T5-based Statement Extraction Models, our best model predicts statements which are 82% similar to the ground-truth (F1 score of 0.97 for extracting entities). We demonstrate the advantages of representing information as statements by applying our model to over 2700 tables from ESG reports. The homogeneous nature of statements permits data-science analysis on expansive information found in large collections of tables.
Paper Type: long
Research Area: Information Extraction
Contribution Types: Model analysis & interpretability, Data resources, Data analysis, Position papers
Languages Studied: English
Preprint Status: There is no non-anonymous preprint and we do not intend to release one.
A1: yes
A1 Elaboration For Yes Or No: Section 8
A2: n/a
A2 Elaboration For Yes Or No: NA
A3: yes
A3 Elaboration For Yes Or No: Abstract and Section 1.
B: yes
B1: yes
B1 Elaboration For Yes Or No: We used the deepsearch toolkit for gathering documents and mention it appropriately.
B2: n/a
B2 Elaboration For Yes Or No: NA
B3: n/a
B3 Elaboration For Yes Or No: NA
B4: n/a
B4 Elaboration For Yes Or No: NA
B5: n/a
B6: yes
B6 Elaboration For Yes Or No: Section 4
C: yes
C1: yes
C1 Elaboration For Yes Or No: Section 5
C2: yes
C2 Elaboration For Yes Or No: Section 5
C3: yes
C3 Elaboration For Yes Or No: Section 5
C4: yes
C4 Elaboration For Yes Or No: Mentioned where applicable.
D: yes
D1: yes
D1 Elaboration For Yes Or No: Section 4 explains our annotations.
D2: n/a
D2 Elaboration For Yes Or No: NA
D3: n/a
D3 Elaboration For Yes Or No: NA
D4: n/a
D4 Elaboration For Yes Or No: NA
D5: n/a
D5 Elaboration For Yes Or No: NA
E: no
E1: n/a
E1 Elaboration For Yes Or No: NA
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview