Analyzing Cognitive Plausibility of Subword Tokenization

Lisa Beinborn; Yuval Pinter

Analyzing Cognitive Plausibility of Subword Tokenization

Lisa Beinborn, Yuval Pinter

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX

Submission Type: Regular Short Paper

Submission Track: Linguistic Theories, Cognitive Modeling, and Psycholinguistics

Submission Track 2: Phonology, Morphology, and Word Segmentation

Keywords: subword tokenization, subword segmentation, cognitive signals, cognitive plausibility, lexical decision, vocabulary size, morphological segmentation

TL;DR: We evaluate subword tokenization algorithms using cognitive signals.

Abstract: Subword tokenization has become the de-facto standard for tokenization although comparative evaluations of their quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm on the performance in downstream tasks, or on engineering criteria such as the compression rate. We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization. We analyze the correlation of the tokenizer output with the reading time and accuracy of human responses on a lexical decision task. We compare three tokenization algorithms across several languages and vocabulary sizes. Our results indicate that the Unigram algorithm yields less cognitively plausible tokenization behavior and a worse coverage of derivational morphemes, in contrast with prior work.

Submission Number: 2715

Loading