Abstract: This paper describes an approach to estimating the unknown publication date for printed historical documents from their scanned page images, using Convolutional Neural Networks (CNN). The method primarily harnesses visual features from small image patches. Optionally, we augment the feature set with textual Optical Character Recognition (OCR) result features to improve accuracy, though at greater preprocessing cost. To be applied in various tasks, we develop both classification and regression models. As an example application, we show that OCR can be improved if we use estimated publication date to select the appropriate OCR model. Moreover, the resulting improvement in OCR accuracy is close to what could be achieved knowing the true publication date. We are not aware of previous work in estimating publication dates for printed historical documents with visual features.
0 Replies
Loading