Subword Embedding from Bytes Gains Privacy without Sacrificing Accuracy  and Complexity

Anonymous

Subword Embedding from Bytes Gains Privacy without Sacrificing Accuracy and Complexity

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: While NLP models significantly impact our lives, there are rising concerns about privacy invasion. Although federated learning enhances privacy, attackers may recover private training data by exploiting model parameters and gradients. Therefore, protecting against such an embedding attack remains an open challenge. We propose Subword Embedding from Bytes (SEB) and encode subwords to byte sequences using neural networks that are harder to retrieve in attacks. Importantly, our method requires a smaller memory with only $256$ bytes of vocabulary while keeping efficiency with the same input length as usual. Thus, our solution outperforms conventional approaches by preserving privacy without sacrificing efficiency or accuracy. Our experiments show SEB can effectively protect against embedding-based attacks from recovering original sentences in federated learning. Meanwhile, we verify that SEB obtains comparable and even more accurate prediction results over standard subword embedding methods in machine translation, sentiment analysis, and language modeling with even lower time and space complexity.

Paper Type: long

Research Area: Machine Learning for NLP

Contribution Types: Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: English, German

0 Replies

Loading