PDF-to-Tree: Parsing PDF Text Blocks into a Tree

PDF-to-Tree: Parsing PDF Text Blocks into a Tree

ACL ARR 2024 June Submission2237 Authors

15 Jun 2024 (modified: 17 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In PDF documents, the reading order of text blocks is missing, which can hinder machine understanding of the document's content.Existing works try to extract one universal reading order for a PDF file.However, applications, like Retrieval Augmented Generation (RAG), require breaking long articles into sections and subsections for better indexing.For this reason, this paper introduces a new task and dataset, PDF-to-Tree, which organizes the text blocks of a PDF into a tree structure.Since a PDF may contain thousands of text blocks, far exceeding the number of words in a sentence, this paper proposes a transition-based parser that uses a greedy strategy to build the tree structure.Compared to parser for plain text, we also use multi-modal features to encode the parser state.Experiments show that our approach achieves an accuracy of 93.93%, surpassing the performance of baseline methods by an improvement of 6.72%.

Paper Type: Long

Research Area: Information Extraction

Research Area Keywords: Information Extraction

Contribution Types: NLP engineering experiment

Languages Studied: en

Submission Number: 2237

Loading