Faster Training of Word Embeddings

Eliza Wszola; Martin Jaggi; Markus Püschel

Faster Training of Word Embeddings

Eliza Wszola, Martin Jaggi, Markus Püschel

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: multicore, performance, machine learning, word embeddings, word2vec, fasttext

Abstract: Word embeddings have gained increasing popularity in the recent years due to the Word2vec library and its extension fastText that uses subword information. In this paper, we aim at improving the execution speed of fastText training on homogeneous multi- and manycore CPUs while maintaining accuracy. We present a novel open-source implementation that flexibly incorporates various algorithmic variants including negative sample sharing, batched updates, and a byte-pair encoding-based alternative for subword units. We build these novel variants over a fastText implementation that we carefully optimized for the architecture, memory hierarchy, and parallelism of current manycore CPUs. Our experiments on three languages demonstrate 3-20x speed-up in training time at competitive semantic and syntactic accuracy.

One-sentence Summary: Design, hardware-oriented implementation and evaluation of various algorithmic variants of fastText and word2vec

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Reviewed Version (pdf): https://openreview.net/references/pdf?id=oqtIL8DEZ

10 Replies

Loading