Do We Really Need Subword Prefix For Bert Tokenizer? An Empirical Study

Anonymous

Do We Really Need Subword Prefix For Bert Tokenizer? An Empirical Study

Anonymous

03 Sept 2022 (modified: 05 May 2023)ACL ARR 2022 September Blind SubmissionReaders: Everyone

Abstract: Subword tokenizers are widely adopted for pretrained language models (PLMs). As they are originally designed for machine translation (MT) tasks, these tokenizers always introduce a subword prefix to distinguish starting and continuing tokens for better decoding. In this paper, we empirically study the necessity of the subword prefix for pretrained language models on natural language understanding (NLU) tasks. Experimental results show that our prefix-free variant, BERT-SPFT, achieves a comparable result with 19% fewer embedding parameters. The further probing task also suggests that the capability to distinguish the subword types is not related to the model performance.

Paper Type: short

0 Replies

Loading