Abstract: Tabular data contains structural information in the form of rows and columns and is utilized across various industries, including finance, government, and science. However, during the Optical Character Recognition (OCR) process, table structures can become distorted, or cell values may be extracted incorrectly, leading to a decline in the performance of subsequent tasks that rely on tabular data. Existing OCR post-processing techniques primarily focus on general text recovery, which limits their effectiveness in restoring complex table structures. To address this issue, we have constructed a large-scale benchmark dataset called TabHD for recovering tabular data damaged after OCR processing. Using this dataset, we systematically evaluate the table restoration performance of Large Language Models (LLMs). This study explores the potential of LLM-based table restoration in the OCR post-processing pipeline and suggests directions for the development of more sophisticated models in the future.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Large Language Models, Post-OCR, Graph, Tabular Data, Benchmark
Contribution Types: Data resources
Languages Studied: english
Submission Number: 2712
Loading