An Empirical Study of Information Extraction from Vietnamese Documents

Lam Nguyen, Tien-Dong Nguyen, Minh-Tuan Dang, Dinh-Nguyen Vu, Viet-Anh Nguyen, Hoang-Dang Nguyen

Published: 2023, Last Modified: 22 May 2025RIVF 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Information Extraction (IE) is the procedure of transforming unstructured data into structured formats. IE is vital to many document intelligence applications and has high potential in business as it can automate the manual data extraction of all required information from various kinds of documents. However, although a substantial amount of research has been done to develop information extraction approaches using deep learning models, none of these have applied the proposed methods to the Vietnamese documents domain. In this study, we provide a comprehensive review of the literature on IE at the time of this writing. We also propose a Vietnamese Visually Rich Document (VRD) benchmark consisting of input images, text contents, and corresponding IE labels. We verify the applicability on this benchmark using multiple IE models such as (MSAU, LayoutLM, GNN) and show strong results on all of these methods and evaluate the capabilities of each model in this dataset.