Elementwise Language Representation

TMLR Paper905 Authors

28 Feb 2023 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We propose a new language representation method that generalizes all types of tokenization into a unified framework called elementwise language representation. This method represents each token using $\mathcal{N}$ low-dimensional byte embeddings, which are concatenated into a single vector. Using this framework, models can process text regardless of the tokenization applied. Most notably, by matching the number of attention heads in a Transformer architecture with $\mathcal{N}$, we can reduce its self-attention complexity proportional to the model size. This technique requires no architectural modifications of the backbone Transformer or additional overhead. Through experiment, we demonstrate that existing Transformer architectures trained within the proposed framework are improved in terms of efficiency, robustness and inference speed. These observations suggest the potential for an optimal pre-training objective built upon the elementwise language representation, guiding future works to focus on refining this approach.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Marcus_Rohrbach1
Submission Number: 905
Loading