Strategies in subword tokenization: humans vs. algorithms

Anonymous

Strategies in subword tokenization: humans vs. algorithms

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone

Abstract: The output of subword tokenization can be very different depending on what algorithm is used. It is typically judged as more or less plausible, depending on how much it corresponds to human intuition. A subword vocabulary overlap between manual and automatic segmentation is an indicator of plausibility, but it does not reveal much on how the process of segmentation compares with human analysis. In this study, we propose a new method to analyze subword segmentation strategies relying on a spatial analysis of the distribution of subwords' lengths. Our experiments on English, Finnish and Turkish show that humans tend to balance creativity and consistency, while algorithms tend to be either strongly biased or inconsistent. To imitate humans better, algorithms need to produce subword segments of moderately uneven length, which can be achieved by combining complementary strategies.

0 Replies

Loading