Abstract: While many pipelines for extracting information from tables assume simple table structure, tables in the financial domain frequently have a complex, hierarchical structure. The primary example would be parent-child relationships between header cells. Most prior datasets of tables annotated from images or pdf and most models for extracting table structure concentrate on the problems of table boundaries, cell, row, and column bounding box extraction. The area of fine-grained table structure remains relatively unexplored. This study presents a dataset of 657 tables, manually labeled for cell types and column hierarchy relations. The tables are selected from IBM FinTabNet. The selection of these 657 tables is performed using heuristics, resulting in a much larger proportion, roughly half, of the selected tables having a complex hierarchical structure than a random sample from FinTabNet. Further, we fine-tune models based on LayoutLM on the cell-type classification task and identify hierarchical relations among column headers. We achieve F1 scores of 97% and 73% on the respective tasks. Finally, we use the trained model to create soft labels for the entirety of FinTabNet
Paper Type: short
0 Replies
Loading