# FlexiToken-Tokenization

FlexiToken is a flexible, script-aware tokenization and compression framework for multilingual and domain-adaptive NLP tasks. It enables custom byte-vocabulary extensions, script-specific tokenization, and configurable compression rates for efficient model training and evaluation.

---

## Features

- **Script-aware tokenization:** Add new tokens for specific scripts.
- **Configurable compression:** Control compression rates via binomial priors.
- **Flexible configuration:** Modify tokenization and training parameters via YAML config files.
- **Downstream evaluation:** Supports evaluation on a variety of NLP tasks and datasets.

---

## Environment Setup

```bash
conda create -n fxt python=3.8
pip install -r requirements.txt
```

---

## Configuration

- Config files are located in the `configs/` directory.
- The main section to modify is the `boundaries` section:
  - `script_tokens`: New tokens to add for each script.
  - `prior_list`: Maps scripts to binomial priors for compression.
  - `temp`: Temperature for Gumbel Sigmoid.

---

## Training

To train a model with FlexiToken, run:

```bash
bash scripts/run_train.sh
```

- Training configurations are in `configs/train/`.
- Adjust the YAML files to set priors, tokenization strategies, and other hyperparameters.

---

## Downstream Evaluation

Evaluate trained models on downstream tasks:

```bash
# Example: Evaluate pretraining
bash scripts/eval_pretrain.sh

# For finetuning and task-specific evaluation, see scripts in scripts/finetune/
```

- Downstream configs are in `configs/finetune/`.
- Results are saved in the `results/` directory.

---

## Directory Structure

- `src/`: Source code for model, training, evaluation, and utilities.
- `configs/`: YAML configuration files for training and finetuning.
- `data/`: Tokenizer data and datasets.
- `results/`: Output and evaluation results.
- `scripts/`: Shell scripts for training, evaluation, and finetuning.

---

## Citation

If you use this codebase, please cite the corresponding paper (add citation here if available).


