Abstract: This paper proposes multiple techniques to improve runtime efficiency of Japanese tokenization based on the Pointwise Linear Classification (PLC) framework, which formulates the whole tokenization process as a sequence of linear classification problems. Our techniques are optimized by leveraging the characteristics of the PLC framework and the task definition itself. Specifically, we introduce (1) composing multiple classifications into array-based operations, (2) efficient feature lookup with memory-optimized automata, and (3) three orthogonal preprocessing to reduce actual score calculation. Combining these techniques, our implementation works 5.7 times faster than the existing tokenizer based on the same model without any loss of tokenization accuracy.
Paper Type: long
0 Replies
Loading