Japanese Mistakable Legal Term Correction using Infrequency-aware BERT Classifier

Takahiro Yamakoshi; Takahiro Komamizu; Yasuhiro Ogawa; Katsuhiko Toyama

Japanese Mistakable Legal Term Correction using Infrequency-aware BERT Classifier

Takahiro Yamakoshi, Takahiro Komamizu, Yasuhiro Ogawa, Katsuhiko Toyama

Published: 01 Jan 2019, Last Modified: 30 Jul 2025IEEE BigData 2019EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We propose a method that assists legislative drafters in locating inappropriate legal terms in Japanese statutory sentences and suggests corrections. We focus on sets of mistakable legal terms whose usages are defined in legislation drafting rules. Our method predicts suitable legal terms using a classifier based on a BERT (Bidirectional Encoder Representations from Transformers) model. We apply three techniques in training the BERT classifier, specifically, preliminary domain adaptation, repetitive soft undersampling, and classifier unification. These techniques cope with two levels of infrequency: legal term-level infrequency that causes class imbalance and legal term set-level infrequency that causes underfitting. Concretely, preliminary domain adaptation improves overall performance by providing prior knowledge of statutory sentences, repetitive soft undersampling improves performance on infrequent legal terms without sacrificing performance on frequent legal terms, and classifier unification improves performance on infrequent legal term sets by sharing common knowledge among legal term sets. Our experiments show that our classifier outperforms conventional classifiers using Random Forest or a language model, and that all three training techniques contribute to performance improvement.

Loading