Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Jonathan Hayase; Alisa Liu; Yejin Choi; Sewoong Oh; Noah A. Smith

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Jonathan Hayase, Alisa Liu, Yejin Choi, Sewoong Oh, Noah A. Smith

Published: 03 Jul 2024, Last Modified: 20 Jul 2024ICML 2024 FM-Wild Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: distribution inference, tokenization

Abstract: The pretraining data of today's strongest language models remains opaque, even when their parameters are open-sourced. In particular, little is known about the proportions of different domains, languages, or code represented in the data. While a long line of membership inference attacks aim to identify training examples on an instance level, they do not extend easily to _global_ statistics about the corpus. In this work, we tackle a task which we call _data mixture inference_, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information — byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data: the first merge is the most common byte pair, the second is the most common pair after merging the first token, and so on. Given a tokenizer's merge list along with data samples for each category of interest (e.g., different natural languages), we formulate a linear program that solves for the relative proportion of each category in the tokenizer's training set. Importantly, to the extent to which tokenizer training data is representative of the pretraining data, we indirectly learn about the pretraining data. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: `Gpt-4o` is much more multilingual than its predecessors, training on 10× more non-English data than `Gpt-3.5`; `Gpt-3.5` and `Claude` are trained on predominantly code; many recent models (or at least their tokenizers) are trained on 7-23% English books. We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.

Submission Number: 105

Loading