Keywords: bert, model distillation, transfer learning, word embeddings, transformers, natural language processing
TL;DR: This study proposes a method to replace the vocabulary-rigid transformer model's word-embedding layer with a vocabulary-free one using a CNN, and finds cosine-based metrics yield better results, offering a path to more flexible NLP models.
Abstract: This paper addresses the limitations of subword based models in NLP by aligning the word embedding layer of a vocabulary-rigid transformer model to a vocabulary-free one. In order to do so, a CNN is trained to mimic the word embeddings layer of a BERT model, using a sequence of byte tokens as input. The study compares cosine-based and Euclidean-based loss functions for training the student network and finds better results with cosine-based metrics. The research contributes techniques for re-training transformer embedding layers and provides insights into loss function selection. The findings have implications for developing flexible and robust NLP models.
Submission Type: Archival (to be published in the Journal of LatinX in AI (LXAI) Research)
Submission Number: 11
Loading