PDF-to-Tree: Parsing PDF Text Blocks into a Tree

ACL ARR 2024 June Submission2237 Authors

15 Jun 2024 (modified: 17 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In PDF documents, the reading order of text blocks is missing, which can hinder machine understanding of the document's content.Existing works try to extract one universal reading order for a PDF file.However, applications, like Retrieval Augmented Generation (RAG), require breaking long articles into sections and subsections for better indexing.For this reason, this paper introduces a new task and dataset, PDF-to-Tree, which organizes the text blocks of a PDF into a tree structure.Since a PDF may contain thousands of text blocks, far exceeding the number of words in a sentence, this paper proposes a transition-based parser that uses a greedy strategy to build the tree structure.Compared to parser for plain text, we also use multi-modal features to encode the parser state.Experiments show that our approach achieves an accuracy of 93.93%, surpassing the performance of baseline methods by an improvement of 6.72%.
Paper Type: Long
Research Area: Information Extraction
Research Area Keywords: Information Extraction
Contribution Types: NLP engineering experiment
Languages Studied: en
Submission Number: 2237
Loading