LaTeX Rainbow: Open Source Document Layout Semantic Annotation Framework

Published: 09 Oct 2023, Last Modified: 29 Oct 2023NLP-OSS 2023EveryoneRevisionsBibTeX
Keywords: Document Layout Analysis, Dataset Construction, Information Extraction
Abstract: Document Layout Analysis technology is advancing rapidly thanks to the large amount of high-quality labeled data. However, existing datasets comprised of document collections have shortcomings: (1) the hierarchical structure of papers is lost because labelling is done in terms of pages rather than documents; (2) content that is not part of the author's text such as page headers etc. is not filtered out; (3) papers included in a dataset are not likely to be up-to-date, i.e. they are not necessarily the latest version of a paper. We propose LaTeX Rainbow, an open source annotation framework that can automatically annotate any LaTeX source code. This tool extends existing annotation methods by taking into account the properties of different datasets. It can produce token-level semantic structure annotations and preserve the paper's reading order as well as extract the table of contents i.e. information about the article's structure. LaTeX Rainbow enables anyone to extend their datasets with the latest documents. This framework also has the flexibility of modifiable parsing rules and the potential to improve performance through parallelization. The project is open sourced on Github https://github.com/InsightsNet/texannotate.
Submission Number: 23
Loading