EditTrans: Speedy Edit-based Detailed Transformation of Academic Documents into Markup

ACL ARR 2024 June Submission2353 Authors

15 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Academic Documents stored in PDF format can be transformed into plain text structured markup languages to enhance accessibility. Markup languages allow for easier updates and customization, making academic content more adaptable and accessible to diverse usage, such as linguistic corpus compilation. Existing end-to-end decoder transformer models can transform screenshots of documents into markup language, their flexibility is superior to encoder transformers based on Document Layout Analysis. However, decoder transformers have more parameters and operate more slowly. Their token-by-token decoding from scratch wastes a lot of inference steps in generating dense text, which can be directly copied from PDF files. To solve this problem, we introduce EditTrans, whose features allow identifying a queue of to-be-edited text from a PDF before starting to generate markup language. EditTrans contains a lightweight classifier that is fine-tuned from a Document Layout Analysis model on 162,127 pages of documents from arXiv. In our evaluations, EditTrans reduced the number of generation steps by 42.9% compared to end-to-end decoder transformer models.
Paper Type: Short
Research Area: Special Theme (conference specific)
Research Area Keywords: cross-modal information extraction, document-level extraction, resource-efficient NLP
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 2353