Layout-Aware Neural Model for Resolving Hierarchical Table StructureDownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: While many pipelines for extracting information from tables assume simple table structure, tables in the financial domain frequently have complex, hierarchical structure. The main example would be parent-child relationships between header cells. Most prior datasets of tables annotated from images or .pdf and most models for extracting table structure concentrate on the problems of table, cell, row, and column bounding box extraction. The area of fine-grained table structure remains relatively unexplored. In this study, we present a dataset of 887 tables, manually labeled for cell types and column hierarchy relations. The tables are selected from IBM FinTabNet, a much larger dataset of more than 100,000 financial tables having cell, row, and column bounding boxes extracted by deep learning, but not including semantic cell type or cell-to-cell relation labels, which we add. Selection of these 887 tables is performed using heuristics which result in a much larger proportion, roughly half, of the selected tables having complex hierarchical structure, than a random sample from FinTabNet. Further, we fine-tune models based on LayoutLM on the cell-type classification task and on the identification of hiearchical relations among column headers. We achieve F1 scores of 95% and 70% on the respective tasks. Finally, we use the trained model to create soft labels for the entirety of FinTabNet.
0 Replies

Loading