Abstract: Tokenization and sub-tokenization based models like word2vec, BERT and the GPT-like
models are the state-of-the-art in natural language processing. Typically, these approaches
have limitations with respect to their input representation. They fail to fully capture or-
thographic similarities and morphological variations, especially in highly inflected and low-
resource languages. Additionally, the needed sub-tokenization for these languages increases
the length of the input sequence. The input sequence is even larger for purely byte- or
character-based models. To mitigate this problem, we propose a method to compute word
vectors directly from character strings, integrating both semantic and syntactic information.
We denote this trunsformer-based approach Rich Character Embeddings (RCE). Further-
more, we propose a hybrid model that combines trunsformer and convolutional mecha-
nisms. Both vector representations can be used as a drop-in replacement for dictionary-
and subtoken-based word embeddings in existing model architectures. It has the potential
to improve performance for both large context-based language models like BERT and small
models like word2vec for low-resource and morphologically rich languages. We evaluate our
approach on various tasks like the SWAG, declension prediction for inflected languages,
metaphor and chiasmus detection for various languages. Our experiments show that it out-
performs traditional token-based approaches on limited data using Odd-One-Out and TopK
metrics as well as on application-based downstream tasks.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Adin_Ramirez_Rivera1
Submission Number: 7693
Loading