Learning Tokenization in Private Federated Learning with Sub-Word Model Sampling

Anonymous

Learning Tokenization in Private Federated Learning with Sub-Word Model Sampling

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone

Abstract: Federated learning with differential privacy, i.e. private federated learning (PFL), makes it possible to train models on private data distributed across users' devices without harming privacy. However, it is only known how to do this for models, such as neural networks, that have a fixed number of parameters, and thus a fixed-dimensional gradient vector. Such models include neural-net language models, but not n-gram language models or, indeed, tokenizers, the topic of this work. Training a tokenizer normally requires access to the training data. An alternative is to train the tokenizer on publicly available data, but this, we show, degrades accuracy for a next-word prediction task by 10-20% across different datasets and models. We propose to take a tokenizer built on public data, use it to train a language model with PFL, and sample from the language model to find a new tokenizer. Retraining with the new tokenizer brings performance to within 2\,\% of the oracle tokenizer, without expending additional privacy budget. Finally, we build a new federated pipeline to update the tokenizer during model training by modifying affected model embeddings.

0 Replies

Loading