PPTSER: A Plug-and-Play Tag-guided Method for Few-shot Semantic Entity Recognition on Visually-rich Documents

Wenhui Liao; Jiapeng Wang; Longfei Xiong; Lianwen Jin

PPTSER: A Plug-and-Play Tag-guided Method for Few-shot Semantic Entity Recognition on Visually-rich Documents

Wenhui Liao, Jiapeng Wang, Longfei Xiong, Lianwen Jin

15 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Few-shot Learning, Semantic Entity Recognition, Multi-modal Pre-trained Models, Prompt Learning

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: A simple yet effective Plug-and-Play Tag-guided method for few-shot Semantic Entity Recognition on visually-rich documents

Abstract: Visually-rich document information extraction (VIE) is a vital aspect of document understanding, wherein Semantic Entity Recognition (SER) plays a significant role. However, the study of few-shot SER on visually-rich documents remains largely unexplored despite its considerable potential for practical applications. To address this issue, we propose a simple yet effective Plug-and-Play Tag-guided method for few-shot Semantic Entity Recognition (PPTSER) on visually-rich documents. PPTSER is a pluggable method building upon off-the-shelf multi-modal pre-trained models. It leverages the semantics of the tags to guide the SER task. In essence, PPTSER reformulates SER into entity typing and span detection, handling both tasks simultaneously via cross-attention. Experimental results illustrate that PPTSER outperforms fine-tuning baseline and existing few-shot methods, especially in low-data regimes. With full training data, PPTSER achieves comparable or superior performance to fine-tuning baseline. Specifically, on the FUNSD benchmark, our method improves the performance of LayoutLMv3 in 1-shot, 3-shot and 5-shot scenarios by 15.61%, 2.13%, and 2.01%, respectively. On the XFUND-zh benchmark, it improves the performance of LayoutLMv3 by 3.73%, 6.16%, and 4.01%, respectively. Overall, PPTSER demonstrates promising generalizability, effectiveness, and plug-and-play nature for few-shot SER on visually-rich documents. The codes will be available.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 119

Loading