PDF-to-Tree: Parsing PDF Text Blocks into a Tree

Anonymous

PDF-to-Tree: Parsing PDF Text Blocks into a Tree

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: In PDF documents, the reading order of text blocks is missing, which can hinder machine understanding of the document's content.Existing works try to extract one universal reading order for a PDF file.However, applications, like Retrieval Augmented Generation (RAG), require breaking long articles into sections and subsections for better indexing.For this reason, this paper introduces a new task and dataset, PDF-to-Tree, which organizes the text blocks of a PDF into a tree structure.Since a PDF may contain thousands of text blocks, far exceeding the number of words in a sentence, this paper proposes a transition-based parser that uses a greedy strategy to build the tree structure.Compared to parser for plain text, we also use multi-modal features to encode the parser state.Experiments show that our approach achieves an accuracy of 93.93%, surpassing the performance of baseline methods by an improvement of 12.22%.

Paper Type: long

Research Area: Information Extraction

Contribution Types: NLP engineering experiment

Languages Studied: EN

0 Replies

Loading