Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models

Yichao Zhou; James Bradley Wendt; Navneet Potti; Jing Xie; Sandeep Tata

Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models

Yichao Zhou, James Bradley Wendt, Navneet Potti, Jing Xie, Sandeep Tata

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Efficient Methods for NLP

Submission Track 2: Information Extraction

Keywords: information extraction, selective labeling, data efficiency, annotation efficiency

TL;DR: We propose Selective Labeling as a solution to reduce the cost of acquiring labeled data by 10X while achieving negligible loss in document extraction performance.

Abstract: Building automatic extraction models for visually rich documents like invoices, receipts, bills, tax forms, etc. has received significant attention lately. A key bottleneck in developing extraction models for new document types is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. In this paper, we propose selective labeling as a solution to this problem. The key insight is to simplify the labeling task to provide “yes/no” labels for candidate extractions predicted by a model trained on partially labeled documents. We combine this with a custom active learning strategy to find the predictions that the model is most uncertain about. We show through experiments on document types drawn from 3 different domains that selective labeling can reduce the cost of acquiring labeled data by 10× with a negligible loss in accuracy.

Submission Number: 2175

Loading