RetVec: Resilient and Efficient Text Vectorizer

Elie Bursztein, Marina Zhang, Owen Vallis, Xinyu Jia, Alexey Kurakin

Published: 01 Jan 2023, Last Modified: 12 May 2023CoRR 2023Readers: Everyone

Abstract: This paper describes RetVec, a resilient multilingual embedding scheme designed for neural-based text processing, including small-text classification and large-language models. RetVec combines a novel character encoding with an optional small model to embed words into a 256-dimensional vector space. These embeddings enable training competitive multilingual text models resilient to typos and adversarial attacks. In this paper, we evaluate and compare RetVec to state-of-the-art tokenizers and word embeddings on common model architectures. These comparisons demonstrate that RetVec leads to competitive models that are significantly more resilient to text perturbations across a variety of common tasks. RetVec is available under Apache 2 license at \url{https://github.com/[anonymized]}.

0 Replies