# RETVec: Resilient & Efficient Text Vectorizer


## Overview
RETVec is a next-gen text vectorizer designed to offer built-in adversarial resilience using robust word embeddings.

RETVec is trained to be resilient against character manipulations including insertion, deletion, typos, homoglyphs, and more. The RETVec model is trained on top of a novel character embedding which can encode all UTF-8 characters and words. Thus, RETVec works out-of-the-box on over 100 languages without the need for a lookup table or fixed vocabulary size. Furthermore, RETVec is a layer, which means that it can be inserted into any TensorFlow/Keras model without the need for a separate pre-processing step.

### Getting started

### Installation

[Upcoming] You can use pip to install the TensorFlow version of RETVec:

```python
pip install retvec
```

You can also clone the RETVec repo and install the package directly using the `setup.py`. RETVec has been tested on TensorFlow 2.6+ and python 3.7+.

### Basic Usage

Here is a simple example on how to include RETVec into an LSTM model. The `RETVecTokenizer` layer should be the first layer of the model (minus the input layer, which should accept text as input). The layer will handle splitting and vectorizing tokens in the text to RETVec embeddings. Then, the RETVec embeddings can be passed directly to the text model (i.e. LSTM or Transformer) and used normally.

```python
import tensorflow as tf
from tensorflow.keras import layers

from retvec.tf.layers import RETVecTokenizer

inputs = layers.Input(shape=(1,), dtype=tf.string)
x = RETVecTokenizer(model=retvec_model_path, sequence_length=128)(inputs)
# ...
# Build the rest of your model normally
# ...

```

### Notebooks

[Upcoming] We have multiple tutorial notebooks on how to use RETVec for different use cases, located under the `notebooks` folder. The `hello_world.ipynb` notebook provides a simple example on how to integrate RETVec into a TensorFlow model for text classification. The `tpu_tutorial.ipynb` notebook demonstrates how to train RETVec-based models using TPU acceleration, as opposed to using CPU/GPUs.


### RETVec Pre-training

The RETVec model pre-training script is located at `training/train_tf_retvec_models.py`. Example usage:

```python
python train_tf_retvec_models.py --train_config <train_config_path> --model_config <model_config_path> --output_dir <output_path>
```

Configurations for our models are under the `training/configs` folder.

### Benchmarks

Code for reproducing our benchmarks can be found in the `benchmarks` folder. Please see the README.md file in the `benchmarks` folder for more details on training and evaluating benchmarking models.

### Models

The RETVec models can be found in th `models` folder. `retvec-model-256-v0.1.0` is the 230k parameter model with embedding dimension of 256. Please see the tutorial notebooks for how to use the pre-trained RETVec models.

## Disclaimer
This is not an official [anonymized] product.
