Towards a universal dataset and metrics for training and evaluating table extraction models

Brandon Smock; Rohith Pesala; Robin Abraham

Towards a universal dataset and metrics for training and evaluating table extraction models

Brandon Smock, Rohith Pesala, Robin Abraham

08 Jun 2021 (modified: 24 May 2023)Submitted to NeurIPS 2021 Datasets and Benchmarks Track (Round 1)Readers: Everyone

Keywords: table detection, table structure recognition, table extraction, functional analysis, object detection

TL;DR: This paper describes PubTables1M, the largest dataset of its kind, and a new metric, grid table similarity (GriTS), for training and evaluating models for table extraction.

Abstract: Recently, interest has grown in applying machine learning approaches to the problem of table structure inference and extraction from unstructured documents. However, progress in this area has been challenging not only to make but to measure, due to several issues that arise in both training and evaluating such systems from labeled data. This includes challenges as fundamental as the lack of a single definitive ground truth output for a given input sample and the lack of an ideal metric for measuring partial correctness for this task. To address these we propose a new dataset, PubMed Tables One Million (PubTables1M), and a new class of metric, grid table similarity (GriTS). PubTables1M is nearly twice as large as the current largest comparable dataset, can be used for models across multiple architectures and modalities, and addresses issues such as ambiguity and lack of consistency in the annotations. We apply DETR to table extraction for the first time and show that object detection models trained on images and bounding boxes derived from this data produce excellent results out-of-the-box for all three tasks of detection, structure recognition, and functional analysis. In addition to releasing the data, we describe the dataset creation process in detail to enable others to build on our work and to ensure forward and backward compatibility of this data for combining it with other datasets created for these tasks. It is our hope that this data and the proposed metrics can further progress in this area by serving as a single source of data for training and evaluation of a wide variety of models for table extraction.

Supplementary Material: zip

URL: https://pubtables1m.blob.core.windows.net/pubtables1m/README

11 Replies

Loading