Rule-based Adornment of Modern Historical Japanese Corpora using Accurate Universal Dependencies

Jerry Bonnell, Mitsunori Ogihara

Published: 2022, Last Modified: 10 Oct 2023Digit. Humanit. Q. 2022Readers: Everyone

Abstract: Historical materials are an indispensable resource for many scholarly workflows in the Digital Humanities. These workflows can benefit from the application of natural language processing (NLP) pipelines that offer support for tokenization, tagging, lemmatization, and dependency parsing. However, the application of these tools is not trivial as “off-the-shelf,” or pre-trained, tools are prone to error when given historical text as input and training data development can be expensive to carry out in terms of time and expertise needed. This paper introduces a rule-based workflow that can produce improved annotations encoded in Universal Dependencies (UD) targeted for modern historical Japanese corpora using only a pre-trained UD tool as a starting point. The proposed workflow reduces the amount of manual review time needed for training data development and brings improvements over pre-trained tools on a word segmentation task. Moreover, the workflow has the potential to pave the path toward adapting advanced NLP technologies to historical corpora under study.

0 Replies