Keywords: word embedding, natural language understanding, weight sharing, contextual embedding
Abstract: The performance of modern neural language models relies heavily on the diversity of the vocabularies. Unfortunately, the language models tend to cover more vocabularies, the embedding parameters in the language models such as multilingual models used to occupy more than a half of their entire learning parameters. To solve this problem, we aim to devise a novel embedding structure to lighten the network without considerably performance degradation. To reconstruct $N$ embedding vectors, we initialize $k$ bundles of $M (\ll N)$ $k$-sub-embeddings to apply Cartesian product. Furthermore, we assign $k$-sub-embedding using the contextual relationship between tokens from pretrained language models. We adjust our $k$-sub-embedding structure to masked language models to evaluate proposed structure on downstream tasks. Our experimental results show that over 99.9$\%+$ compressed sub-embeddings for the language models performed comparably with the original embedding structure on GLUE and XNLI benchmarks.
One-sentence Summary: We reconstruct the embedding vectors from the shared sub-embedding vectors to lighten the language models.
Supplementary Material: zip
11 Replies
Loading