On Multimodal Few-shot Learning for Visually-rich Document Entity Retrieval

Anonymous

On Multimodal Few-shot Learning for Visually-rich Document Entity Retrieval

Anonymous

16 Dec 2022 (modified: 05 May 2023)ACL ARR 2022 December Blind SubmissionReaders: Everyone

Abstract: Visually-rich document entity retrieval (VDER), which extracts key information (e.g. date, address, name) from document images (e.g., invoice, receipt) has become an increasingly important topic for NLP in the industrial settings. As many of these document images come from document types that are highly specified to their industry, annotating these documents usually requires extensive amount of training and is often costly. The fact that new document types come out at a constant pace and that each of them have a unique set of entity types leave us a challenging setting where we have a large amount of documents with unseen entity types that occur only a couple of time. Such a setting requires models to have the capability of learning entities in a few-shot manner, while recent works in the field can only handle few-shot learning in the document level. We propose an $N$-way $K$-shot setting for VDER that operates on the \textit{entity level} and a new dataset to tackle such a problem. We formulate the problem as a meta learning one and propose a few new algorithms that helps the model to distinguish between in-task-distribution (ITD) entities while being aware of out-of-task-distribution (OTD) ones. To the best of our knowledge, our work is the first systematic study on the $N$-way $K$-shot entity-level setting for VDER.

Paper Type: long

Research Area: Information Retrieval and Text Mining

0 Replies

Loading