Multi-label Classification and Named Entity Recognition for Historical Documents

Ivan Gruber, Miroslav Hlavác, Petr Neduchal, Marek Hrúz

Published: 2024, Last Modified: 06 Jan 2026DIS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we present improvements to our processing pipeline for historical document digitization. The original pipeline is extended with two new functionalities - page labeling, and named entity recognition. We handle page labeling as a multi-label classification task, for which we choose the Query2Label approach. Query2Label is tested on our internal NKVD dataset and reaches a mean average precision equal to 80.03% on the test set. For the named entity recognition task we utilize pre-trained transformer-based models DeepPavlov and benchmark them on two entities - person name, and location. The best model reaches promising results despite not being trained on our data at all.

External IDs:dblp:conf/dis2/GruberHNH24