# AlienLM - Alien Language Model

AlienLM is a project that replaces the tokenizer of existing language models with an encrypted alien tokenizer for training. It creates a new tokenizer using proxy embeddings and trains models with this modified tokenizer.

## Project Structure

```
AlienLM/
├── alien_tokenizer/                 # Alien tokenizer-related code
│   ├── token-freq/                 # Token frequency analysis
│   │   ├── token_count.ipynb      # Calculate token frequencies in training data
│   │   └── result/                # Frequency analysis results
│   │       └── Meta-Llama-3-8B-Instruct/
│   │           └── pro_tok_dict.json
│   ├── token_init/                 # Tokenizer initialization
│   │   └── Meta-Llama-3-8B-Instruct/
│   │       ├── token_matching.py   # Proxy embedding-based token matching
│   │       ├── Efficient_Token_Matcher.py
│   │       ├── build_tokenizer.ipynb  # Alien tokenizer generation
│   │       └── matches-sim-and-diff.txt
│   └── tokenizers/                 # Generated tokenizer storage
│       ├── original/              # Original tokenizer
│       ├── alienlm/               # Final alien tokenizer
│       └── qwenv2_bucket_random_*/ # Tokenizers generated with various seeds
├── encryption-adaptive-train/       # Encryption adaptive training
│   ├── configs/                   # Training configuration files
│   │   └── full.yaml             # Full training configuration
│   └── scripts/                   # Execution scripts
│       └── train_full.sh         # Training execution script
├── axolotl/                        # Axolotl training framework (submodule)
│   ├── src/                       # Source code
│   ├── deepspeed_configs/         # DeepSpeed configurations
│   └── examples/                  # Example configuration files
├── build_translator.py             # Original-alien text converter
└── requirements.txt               # Python dependencies

```

## Environment Setup

### 1. Prerequisites
- Python 3.10+
- CUDA 11.7+

### 2. Install Dependencies

```bash
# Install base packages
pip install -r requirements.txt

# Install Axolotl with local file for fixed version
cd axolotl
pip install -e .
cd ..

# Install additional required packages
pip install faiss-gpu  # for token matching
```

### 3. Environment Variables

```bash
# Set HuggingFace cache directories
export HF_DATASETS_CACHE=/your/path/to/HF_DATASET
export TRANSFORMERS_CACHE=/your/path/to/MODELS

# Set model and data paths (modify as needed)
export WORKSPACE=/your/workspace/path
```

## Execution Pipeline

### 1. (Optional) Token Frequency Calculation

Calculate vocabulary frequency of the target tokenizer in training data.
It might help to get better token matching.

```bash
cd alien_tokenizer/token-freq
jupyter notebook token_count.ipynb
# Or convert to Python script and run
```

This step:
- Loads the Magpie-Align dataset
- Calculates occurrence frequency for each token
- Saves results to `result/Meta-Llama-3-8B-Instruct/pro_tok_dict.json`

### 2. Building Alien Language (Alien Tokenizer Initialization)

Generate alien tokenizer using proxy embeddings.

```bash
cd alien_tokenizer/token_init/Meta-Llama-3-8B-Instruct

# Step 1: Execute token matching
python token_matching.py

# Step 2: Generate alien tokenizer
jupyter notebook build_tokenizer.ipynb
# Or convert to Python script and run
```

This step:
- Uses Qwen2.5-7B-Instruct as proxy model
- Creates proxy embeddings for each token in the original tokenizer
- Finds optimal matching considering token frequencies
- Saves new alien tokenizer to `alien_tokenizer/tokenizers/alienlm/`

### 3. Encryption Adaptation Training (EAT)

Train the model using the generated alien tokenizer.

```bash
cd encryption-adaptive-train/scripts

# Execute training
./train_full.sh
```

Training configuration (`configs/full.yaml`):
- Base model: meta-llama/Meta-Llama-3-8B-Instruct
- Tokenizer: initialized alien tokenizer
- Dataset: Magpie-Align/Magpie-Llama-3-Pro-300K-Filtered
- Training epochs: 2
- Sequence length: 2048
- Optimizer: paged_adamw_8bit

### 4. Inference with Translator

Use translator to convert text when using the actual API model.

```python
from build_translator import build_translator, load_tokenizer

# Load tokenizers
original_tokenizer, alien_tokenizer = load_tokenizer(
    model_id="meta-llama/Meta-Llama-3-8B-Instruct",
    alien_tokenizer_path="/path/to/alien_tokenizer/alienlm"
)

# Create translator
translator = build_translator(original_tokenizer, alien_tokenizer)

# Usage example
original_text = "Hello, world!"
encoded_text = translator.encode(original_text)  # original → alien
print(f"Encoded: {encoded_text}")

decoded_text = translator.decode(encoded_text)  # alien → original
print(f"Decoded: {decoded_text}")
```

## Key Files Description

- `token_matching.py`: Script that finds optimal token matching using proxy model embeddings
- `build_tokenizer.ipynb`: Creates actual alien tokenizer based on matching results
- `build_translator.py`: Handles conversion between original and alien text
- `train_full.sh`: Model training execution script using Axolotl