Efficient Japanese Tokenization Based on Improved Pointwise Linear Classification

Anonymous

Efficient Japanese Tokenization Based on Improved Pointwise Linear Classification

Anonymous

17 Apr 2022 (modified: 05 May 2023)ACL ARR 2022 April Blind SubmissionReaders: Everyone

Abstract: This paper proposes multiple techniques to improve runtime efficiency of Japanese tokenization based on the Pointwise Linear Classification (PLC) framework, which formulates the whole tokenization process as a sequence of linear classification problems. Our techniques are optimized by leveraging the characteristics of the PLC framework and the task definition itself. Specifically, we introduce (1) composing multiple classifications into array-based operations, (2) efficient feature lookup with memory-optimized automata, and (3) three orthogonal preprocessing to reduce actual score calculation. Combining these techniques, our implementation works 5.7 times faster than the existing tokenizer based on the same model without any loss of tokenization accuracy.

Paper Type: long

0 Replies

Loading