AURORA: An Information Extraction System of Domain-specific Business Documents with Limited Data

Linh Le

Published: 01 Oct 2020, Last Modified: 28 Sept 2024CIKM 2020EveryoneCC BY 4.0

Abstract: Information extraction is a well-known topic that plays a critical role in many NLP applications as its outputs can be considered as an entrance step for digital transformation. However, there still exist gaps when applying research results to actual business cases. This paper introduces AURORA, an information extraction for domain-specific business documents. The intuition of AURORA is to use transfer learning for extraction. To do that, it utilizes the power of transformers for dealing with the limitation of training data in business cases and stacks additional layers for domain adaptation. We demonstrate AURORA in the context of actual scenarios where users are invited to experience two functions: fine-grained and whole paragraph extraction of Japanese business documents. A video of the system is available at http://y2u.be/xHQpYE41Tqw.