Abstract: One of the issues in extracting natural language sentences from PDF documents is the identification of non-textual elements in a sentence. In this paper, we report our preliminary results on the identification of in-line mathematical expressions. We first construct a manually annotated corpus and apply conditional random field (CRF) for the math-zone identification using both layout features, such as font types, and linguistic features, such as context n-grams, obtained from PDF documents. Although our method is naive and uses a small amount of annotated training data, our method achieved an 88.95% F-measure compared with 22.81% for existing math OCR software.
0 Replies
Loading