Metadata-Version: 2.4
Name: tokenizer_conversion
Version: 0.0.1
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Provides-Extra: vllm
Requires-Dist: vllm; extra == "vllm"

# Tokenizer Conversion

This repository contains the code to reproduce the experiments in the paper "Transducing Language models".

## Install dependencies

```bash
pip install -e .
```
```bash
pip install -r requirements.txt
```

```bash
cd src/transformers
pip install -e .
```

## FSTs & Models
You can inspect the code for constructing the FSTs:  

The FST for transforming a language model into a character-level model is given in `tokenizer_conversion/machines/hf_realpha.py`  
The Penn Treebank FST given in `tokenizer_conversion/machines/ptb.py`  
The DNA FST given in `tokenizer_conversion/machines/dna2aa.py`  

The pretrained DNA model is available from the following (anonymized) Google Drive: https://drive.google.com/file/d/1UozkUrKE5NKKxm9uTbA_-bzjWx4cu3SY/view?usp=drive_link

## Experiments

Here we provide instructions on how to reproduce the experiments.

### Benchmarking for Different Thresholds:

```
mkdir results
```

#### Alphabetization
```
python src/tokenizer_conversion/benchmarking/prefix_probs_bytes_threshold.py --model gpt2-large --split test --paragraphs 10  --transducer hf_realpha --output results/gpt2_large_hf_realpha.pkl
```

#### Penn Treebank FST
```
src/tokenizer_conversion/benchmarking/prefix_probs_bytes_threshold.py --model gpt2-large --split test --paragraphs 10 --transducer ptb --output results/gpt2_large_ptb.pkl 
```

#### DNA to Nucleotides
```
src/tokenizer_conversion/benchmarking/prefix_probs_bytes_threshold.py --model dna_gpt2 --split test --paragraphs 10 --transducer hf_dna2aa --output results/gpt2_dna_hf_dna2aa.pkl
```

### Processing the Results

To obtain the Jensen Shannon distances, run:  
```
python src/tokenizer_conversion/benchmarking/jsd.py
```

### Benchmarking - Converting Universal States to Non-Universal States

```
mkdir results_drop_universals
```

#### Alphabetization
```
python src/tokenizer_conversion/benchmarking/prefix_probs_drop_universals.py --model gpt2-large --split test --length 256 --transducer hf_realpha --output results_drop_universals/drop_us_gpt2.pkl
```


#### Penn Treebank
```
python src/tokenizer_conversion/benchmarking/prefix_probs_drop_universals.py --model gpt2-large --split test --length 256 --transducer hf_ptb --output results_drop_universals/drop_us_gpt2_large.pkl 
```
