# Rust Tokenizer Worker

High-performance tokenizer worker for the GitHub PR synthesis pipeline. Achieves ~22M tokens/s throughput (12x faster than Go implementation).

## Architecture

**Channel-Based Async Architecture**: Uses bounded channels (like Go channel buffers) to decouple reading, processing, and writing:

```
Go Producer Pool (184 workers)
    ↓ (send requests via stdin)
    ↓
┌─────────────────────────────────────────────────┐
│   Rust Tokenizer Worker (Single Process)       │
│                                                 │
│   Reader Thread                                 │
│       ↓ (reads requests from stdin)             │
│       ↓                                         │
│   Job Channel (bounded buffer: 1000)            │
│       ↓ (individual PRs as jobs)                │
│       ↓                                         │
│   Worker Pool (128 threads via Rayon)           │
│   ├─ Tokenizer Group 1 (threads 1-16)          │
│   ├─ Tokenizer Group 2 (threads 17-32)         │
│   ├─ ...                                        │
│   └─ Tokenizer Group 8 (threads 113-128)       │
│       ↓ (processes jobs in ANY ORDER)           │
│       ↓                                         │
│   Result Channel (bounded buffer: 200)          │
│       ↓ (tokenized results)                     │
│       ↓                                         │
│   Writer Thread                                 │
│       ↓ (batches results, writes to stdout)     │
└─────────────────────────────────────────────────┘
    ↓ (batched responses)
    ↓
Go Harvester (1 goroutine)
    ↓ (receives responses, writes to parquet)
Parquet Writers
```

**Key Design:**
- **3-stage pipeline**: Reader → Workers → Writer (fully decoupled)
- **Bounded channels**: Provide backpressure like Go channel buffers
- **No head-of-line blocking**: Long-tail samples don't block other PRs
- **Out-of-order processing**: Workers process PRs as fast as possible
- **Free response batching**: Writer batches results optimally (default: 100)
- **128 workers**: Optimal for 192-core systems
- **16 threads per tokenizer**: Best performance from benchmarks (8 tokenizer groups)

## Building

```bash
cd tokenizer/rust_worker
cargo build --release
```

The binary will be at `target/release/tokenizer_worker`.

## Command-Line Options

```bash
./tokenizer_worker --help
```

**Options:**
- `-m, --model <MODEL>`: Tokenizer model name or path to tokenizer.json
  Default: `Qwen/Qwen2.5-Coder-32B-Instruct`
  
- `-w, --workers <WORKERS>`: Number of worker threads (0 = num_cpus)
  Default: `128`
  
- `-t, --threads-per-tokenizer <THREADS_PER_TOKENIZER>`: Tokenizer sharing strategy
  - `0` = all threads share one tokenizer
  - `1` = one tokenizer per thread
  - `N` = N threads share one tokenizer
  Default: `16` (optimal from benchmarks)

- `--request-buffer <SIZE>`: Job channel buffer size (like Go channel buffer)
  Default: `1000`
  
- `--response-batch-size <SIZE>`: Number of results to batch before writing
  Default: `100`

**Examples:**

```bash
# Use default settings (128 workers, 16 threads per tokenizer)
./tokenizer_worker

# Use custom model
./tokenizer_worker --model /path/to/tokenizer.json

# Adjust channel buffer sizes
./tokenizer_worker --request-buffer 2000 --response-batch-size 200

# Limit to 64 worker threads
./tokenizer_worker --workers 64
```

## IPC Protocol

**Length-Prefixed MessagePack Binary Protocol**

Solves the `bufio.Scanner: token too long` error and provides better performance than JSON.

### Message Format
```
[4 bytes: message length (u32 big-endian)]
[MessagePack encoded message body]
```

### Request Structure
```rust
struct TokenizeRequest {
    command: String,        // "tokenize"
    prs: Vec<PRText>,       // Can be single PR or multiple PRs
    max_tokens: i32,
}

struct PRText {
    repo_id: i64,
    repo_name: String,
    pr_id: i64,
    text: String,
}
```

### Response Structure
```rust
struct TokenizeResponse {
    status: String,         // "success" or "error"
    results: Vec<TokenizedResult>,  // Can contain multiple results
    error: Option<String>,
}

struct TokenizedResult {
    repo_id: i64,
    repo_name: String,
    pr_id: i64,
    token_ids: Option<Vec<i32>>,
    token_count: i32,
    byte_size: i32,
    discarded: bool,
}
```

**Key Design:**
- No batch IDs: PRs are identified by `(repo_id, pr_id)` which is already unique
- Flexible batching: Requests can contain any number of PRs
- Free response batching: Rust can batch responses however it wants
- Out-of-order processing: Responses don't need to match request order

**Advantages over JSON:**
- No line length limits (fixes buffer overflow)
- ~2-3x faster serialization/deserialization
- Smaller message size (~30% reduction)
- Binary format (no UTF-8 encoding overhead)

## Performance

Benchmark on production server (192 cores, 184 in container):
- **Throughput**: 22M tokens/s (1,810 PRs/s)
- **Data rate**: 88 MB/s
- **Worker configuration**: 128 workers, 16 threads per tokenizer (8 tokenizer groups)
- **Channel buffers**: 1000 jobs, 200 results
- **Response batching**: 100 results per batch
- **IPC overhead**: <1% (length-prefixed MessagePack)

**Why 128 workers with 16 threads per tokenizer?**
- Empirically determined optimal configuration
- Balances tokenizer memory usage vs parallelism
- 8 tokenizer instances total (128 ÷ 16)
- Each tokenizer shared by 16 threads reduces memory by 8x vs one-per-thread

**Channel Buffer Sizing:**
- **Job channel (1000)**: Absorbs bursts from 184 Go producers
- **Result channel (200)**: Buffers results between workers and writer
- **Backpressure**: Bounded channels prevent memory overflow

## Configuration

### Go Pipeline Configuration

The Go pipeline uses an async message queue pattern:

```go
rustConfig := tokenizer.RustWorkerConfig{
    WorkerPath:          cfg.RustTokenizerPath,
    Model:               cfg.TokenizerModel,
    Workers:             128,  // Internal worker threads
    ThreadsPerTokenizer: 16,   // Optimal from benchmarks
}
```

**Architecture Flow:**
1. **184 Go producers** send tokenization requests to Rust worker (non-blocking)
2. **Single Rust process** processes requests with 128 internal workers (out-of-order)
3. **1 Go harvester** receives responses and writes to parquet

**Benefits:**
- Eliminates process spawning overhead (184 processes → 1 process)
- No head-of-line blocking from long-tail samples
- Maintains high concurrency through internal parallelism
- Simplifies resource management
- Reduces memory footprint (8 tokenizers vs 184)

### Environment Variables

```bash
# Path to Rust worker binary
export RUST_TOKENIZER_PATH=./tokenizer/rust_worker/target/release/tokenizer_worker

# Tokenizer model (passed to Rust worker via --model flag)
export TOKENIZER_MODEL=Qwen/Qwen2.5-Coder-32B-Instruct

# Number of Go workers (each spawns one Rust process)
export OFFLINE_CONCURRENCY=184
```

## Tokenizer Model

Uses `Qwen/Qwen2.5-Coder-32B-Instruct` tokenizer from HuggingFace cache.

Download first using Python:
```python
from transformers import AutoTokenizer
AutoTokenizer.from_pretrained('Qwen/Qwen2.5-Coder-32B-Instruct')
```

The worker will automatically find it in:
- `$HF_HOME/hub/models--Qwen--Qwen2.5-Coder-32B-Instruct/snapshots/*/tokenizer.json`
- `~/.cache/huggingface/hub/models--Qwen--Qwen2.5-Coder-32B-Instruct/snapshots/*/tokenizer.json`