TabHD: Table Healer Dataset for OCR-Damaged Table Restoration with LLMs

TabHD: Table Healer Dataset for OCR-Damaged Table Restoration with LLMs

ACL ARR 2025 February Submission2712 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Tabular data contains structural information in the form of rows and columns and is utilized across various industries, including finance, government, and science. However, during the Optical Character Recognition (OCR) process, table structures can become distorted, or cell values may be extracted incorrectly, leading to a decline in the performance of subsequent tasks that rely on tabular data. Existing OCR post-processing techniques primarily focus on general text recovery, which limits their effectiveness in restoring complex table structures. To address this issue, we have constructed a large-scale benchmark dataset called TabHD for recovering tabular data damaged after OCR processing. Using this dataset, we systematically evaluate the table restoration performance of Large Language Models (LLMs). This study explores the potential of LLM-based table restoration in the OCR post-processing pipeline and suggests directions for the development of more sophisticated models in the future.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: Large Language Models, Post-OCR, Graph, Tabular Data, Benchmark

Contribution Types: Data resources

Languages Studied: english

Submission Number: 2712

Loading