# AgentVocab: Structure-Aware Vocabulary Adaptation for Efficient LLM Agents

[![License](https://img.shields.io/badge/License-Apache_2.0-green)](./LICENSE)
[![Framework](https://img.shields.io/badge/Framework-ms--swift-red)](https://github.com/modelscope/ms-swift)
[![Python](https://img.shields.io/badge/Python-3.10-blue)]()

![AgentVocab Framework](assets/main.png)

## 📖 Abstract

Recent Large Language Models (LLMs) have demonstrated strong capabilities in agentic systems. However, a fundamental **training–deployment mismatch** persists: LLMs are trained with general-purpose tokenizers, while agentic usage is dominated by highly structured, repetitive tool-calling patterns (e.g., JSON schemas, function signatures).

This mismatch leads to inefficient tokenization, where structured data is fragmented into long sequences of low-level tokens. To address this, we introduce **AgentVocab**, a structure-aware vocabulary adaptation framework. AgentVocab derives specialized vocabulary entries from real tool-calling traces and adapts the model vocabulary to better reflect structural and semantic regularities.

**Key Results on τ-bench and τ²-bench:**
- 🚀 **Efficiency:** Reduces decoding latency by **15–26%** and input token counts by **~30%**.
- ✅ **Performance:** Maintains or improves tool-calling accuracy compared to vanilla SFT.
- 📉 **Shorter Trajectories:** Reduces the number of interaction turns required to solve tasks.

---

## 💡 Motivation

Standard tokenizers (like BPE) fragment structural keywords into incoherent sub-words. **AgentVocab** fuses these fragments into coherent semantic spans.

![Tokenizer Comparison](assets/tokenizer_comparison.png)
*Figure: Comparison of tokenization granularity. AgentVocab fuses fragmented syntax into coherent structural spans.*

---

## 📊 Experimental Results

We evaluated AgentVocab on **τ-bench** and **τ²-bench** using Qwen2.5-7B-Instruct.

### 1. Main Performance
AgentVocab consistently outperforms Vanilla SFT in efficiency while preserving accuracy.

| Benchmark | Strategy | Accuracy | Input Tokens | Latency (s) | Efficiency Gain |
| :--- | :--- | :---: | :---: | :---: | :---: |
| **τ-bench** | Base | 13.94% | 7156.4 | 0.242 | - |
| **τ-bench** | Vanilla SFT | 19.40% | 7,553.6 | 0.235 | - |
| **τ-bench** | **AgentVocab** | **20.61%** | **4,992.6** | **0.173** | ⚡ **Lat -28.5%** |
| | | | | | |
| **τ²-bench** | Base | 16.36% | 9252.4 | 0.404 | - |
| **τ²-bench** | Vanilla SFT | 21.56% | 9,225.5 | 0.359 | - |
| **τ²-bench** | **AgentVocab** | **21.93%** | **6,795.2** | **0.302** | ⚡ **Lat -25.2%** |



### 2. Training Dynamics & Stability
Unlike Vanilla SFT which suffers from collapse in long-horizon tasks (gray line), AgentVocab (red line) shows a stable improvement in both accuracy and latency.

![Training Dynamics](assets/tau2.png)
*Figure: Step-wise evaluation on τ²-bench. AgentVocab exhibits steadily improving accuracy alongside decreasing latency.*

### 3. Reduced Interaction Turns
AgentVocab not only shortens the token length per message but also helps the agent solve problems in fewer turns.

![Average Turns](assets/turns.png)

---

## 🔍 Case Study

In constraint-sensitive scenarios (e.g., Telecom data limits), AgentVocab helps the model better attend to critical structural boundaries.

![Case Study](assets/case.png)
*Figure: A case from the Telecom domain. The base model fails to identify the data limit constraint, looping for 18 turns. AgentVocab resolves it in 12 turns.*

---

## 🛠️ Installation & Setup

### 1. Environment
```bash
conda create -n agentvocab python=3.10
conda activate agentvocab
pip install -r requirements.txt

```

### 2. Download Assets

We use **Qwen2.5-7B-Instruct** and **Toucan-1.5M** dataset.

**Option A: ModelScope CLI (Recommended)**

```bash
modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./pretrained_models/base_model
modelscope download --dataset Agent-Ark/Toucan-1.5M --local_dir ./data/raw/dataset

```

**Option B: Hugging Face CLI**

```bash
pip install huggingface_hub
huggingface-cli download Qwen/Qwen2.5-7B-Instruct --local-dir ./pretrained_models/base_model
huggingface-cli download Agent-Ark/Toucan-1.5M --repo-type dataset --local-dir ./data/raw/dataset

```

---

## 🚀 Pipeline: Quick Start

We provide an end-to-end pipeline. Run these steps sequentially.

### Step 1: Data Processing

Convert raw parquet to JSONL, verify tokenization details, and filter the dataset.

```bash
# 1. Convert
python src/data_processing/convert_dataset.py \
    --input_dir "path/to/raw_dataset" \
    --output_file "path/to/processed_data.jsonl"

# 2. Verify & Generate Details (Crucial for Content Mining)
python src/data_processing/verify_tokenization.py \
    --input_file "path/to/processed_data.jsonl" \
    --output_file "path/to/verification_details.jsonl" \
    --model_path "path/to/base_model"

# 3. Filter (Recommended)
python src/data_processing/filter_dataset.py \
    --input_file "path/to/processed_data.jsonl" \
    --output_file "path/to/filtered_data.jsonl"

```

### Step 2: Token Mining (Dual-Branch)

**Branch A: Structural Tokens**

```bash
# Mine (Using filtered data)
python src/token_mining/mine_structural_aware_tokens.py \
    --input_file "path/to/filtered_data.jsonl" \
    --output_file "path/to/mined_structural_tokens.json" \
    --model_path "path/to/base_model" \
    --min_frequency 5

# Select Top-N
python src/token_mining/select_best_tokens.py \
    --input_file "path/to/mined_structural_tokens.json" \
    --output_file "path/to/selected_structural_tokens.json" \
    --top_n 500

```

**Branch B: Content Tokens**

```bash
# Mine (Using verification details)
python src/token_mining/mine_content_aware_tokens.py \
    --input_file "path/to/verification_details.jsonl" \
    --output_file "path/to/mined_content_tokens.json" \
    --model_path "path/to/base_model" \
    --num_rounds 3

# Select Top-N
python src/token_mining/select_best_tokens.py \
    --input_file "path/to/mined_content_tokens.json" \
    --output_file "path/to/selected_content_tokens.json" \
    --top_n 2000

```

**Merge Vocabularies**

```bash
python src/token_mining/merge_tokens.py \
    --input_files \
        "path/to/selected_structural_tokens.json" \
        "path/to/selected_content_tokens.json" \
    --output_file "path/to/final_merged_tokens.json"

```

### Step 3: Model Surgery & Initialization

Resize the model and initialize embeddings using **Mean Pooling**.

```bash
python src/model_surgery/expand_vocab.py \
    --base_model_path "path/to/base_model" \
    --new_tokens_file "path/to/final_merged_tokens.json" \
    --output_model_path "path/to/initialized_model" \
    --device cpu

```

### Step 4: Training (SFT)

Uses `ms-swift` with DeepSpeed ZeRO-3. Automatically maintains **Global Batch Size = 64**.

```bash
# Syntax: bash scripts/run_sft.sh <MODEL_PATH> <EXP_NAME> [N_GPUS]

bash scripts/run_sft.sh \
    "path/to/initialized_model" \
    "experiment_name" \
    8

```

### Step 5: Export & Inference

Remove optimizer states for lightweight deployment.

```bash
# Export
# Syntax: bash scripts/export_model.sh <CHECKPOINT_PATH> <OUTPUT_PATH>
bash scripts/export_model.sh \
    "path/to/training_checkpoint" \
    "path/to/final_model"

# Chat
python scripts/inference.py --model_path "path/to/final_model"

```

---

## 📂 Project Structure

```text
.
├── scripts/                # Automation scripts (SFT, Export, Inference)
└── src/                    # Source code
    ├── data_processing/    # ETL pipelines
    ├── token_mining/       # Structural & Content mining logic
    └── model_surgery/      # Embedding initialization & resizing

```