Vietnamese Receipt Information Extraction Using OCR and Deep Learning: A Hybrid Approach with Fuzzy C-Means and PhoBERTv2

Nguyen Hoang Ly, Do Tu Vy Nguyen, Tan Duy Le, Kha Tu Huynh

Published: 2025, Last Modified: 19 Mar 2026IUKM (1) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The extraction of data from receipts is vital for a range of applications across industries, and deep learning techniques offer significant potential in streamlining this process. However, high-quality receipt images are crucial for ensuring accurate data extraction. This paper focuses on optimizing the image preprocessing stage and enhancing information retrieval methods for receipts in Vietnamese. The study employs Fuzzy-C-means clustering to estimate the amount of picture blur. The aim is to better understand the link between blur and character size, which is a key factor in improving PAN algorithm for text detection task and VietOCR performance for text recognition. Additionally, the paper explores information retrieval strategies based on PhoBERT v2, a state-of-the-art model for Vietnamese natural language processing, to efficiently extract and interpret textual data. Following an introduction to the research problem and objectives, the paper reviews related literature, detailing the preprocessing techniques used to enhance image quality, including blur detection with Fuzzy C-Means clustering and removing the background through thresholding techniques such as brightness and chromaticity distortion. The information retrieval methods, leveraging PhoBERT v2 for Vietnamese receipts, are then discussed, highlighting their impact on the accuracy and efficiency of receipt data extraction.
Loading