Abstract: Multilingual automatic speech recognition (ASR) requires tokenization
that efficiently covers many writing systems. Byte-level
BPE (BBPE) using UTF-8 is widely adopted for its language-agnostic
design and full Unicode coverage, but its variable-length encoding
inflates token sequences for non-Latin scripts, such as Chinese,
Japanese, and Korean (CJK). Longer sequences increase computational
load and memory use. We propose BBPE16, a UTF-16-based
BBPE tokenizer that represents most modern scripts with a uniform
2-byte code unit. BBPE16 preserves BBPE’s language-agnostic
properties while substantially improving cross-lingual token sharing.
Across monolingual, bilingual, and trilingual ASR, and in a
multilingual continual-learning setup, BBPE16 attains comparable
or better accuracy; for Chinese, it reduces token counts by up to
10.4% and lowers decoding iterations by up to 10.3%. These reductions
speed up fine-tuning and inference and decrease memory
usage, making BBPE16 a practical tokenization choice for multilingual
ASR.
Loading