Abstract: Standard Byte-Pair Encoding (BPE) tokeniza-
tion compresses text by pairing a learned token
vocabulary with a detailed merge list. Recent
work has shown that this merge list exposes a
potential attack surface for extracting informa-
tion about language model’s training data. In
this paper, we explore the downstream impact
of BPE inference algorithms that do not rely
on this merge list at all, and hence differ from
the encoding process during the BPE training.
To address this question, we investigate two
broad classes of BPE inference schemes that
differ from BPE appliction during training: a)
targetted deviation from merge-lists including
random merge orders, and various corruptions
of merge list involving deletion/truncation, and
b) non-targetted BPE inference algorithms that
do not depend on the merge list but focus on
compressing the text either greedily or exactly.
Extensive experiments across diverse language
modeling tasks like accuracy-based QA bench-
marks, machine translation, and open-ended
generation reveal that while the targetted devi-
ation from the merge lists exhibit significant
degradation in language model performance,
the non-targetted merge-list free inference algo-
rithms result in minimal impact on downstream
performance that is often much smaller than
expected. These findings pave way for simpler
and potentially more privacy-preserving tok-
enization schemes that do not catastrophically
compromise model performance.
Paper Type: Long
Research Area: Phonology, Morphology and Word Segmentation
Research Area Keywords: subword representations,
Contribution Types: Model analysis & interpretability
Languages Studied: English, German
Submission Number: 7387
Loading