# Inference Scaling of LLM Ensembling: Bridging Token Spaces with Token Translation

This repository contains reference implementation for the method proposed in the paper:  
**"Inference Scaling of LLM Ensembling: Bridging Token Spaces with Token Translation"**.

We introduce a lightweight and efficient approach to ensemble large language models (LLMs) with incompatible tokenizers. By leveraging tokenizer priors, we construct a *token translation matrix* that enables robust, scalable token-level ensembling across diverse models.

---

## Requirements

Install the required Python packages via:

```bash
pip install torch>=2.0 transformers==4.39.3 accelerate einops tqdm numpy
```

---

## Step 1: Generate Token Translation Matrix

To align token spaces between **LLaMA 3.1 8B Instruct** and **InternLM2.5 7B Chat**, run:

```bash
python src/transfer_matrix/token_reverse_transfer_matrix.py \
    --model_names meta-llama/Llama-3.1-8B-Instruct internlm/internlm2_5-7b-chat \
    --save_path mapping_matrices/llama3.1_internlm2.5.pt
```

This creates a token translation matrix based on tokenizer behavior and saves it under `mapping_matrices/`.

---

## Step 2: Run Ensemble Inference

Prepare a configuration file at `conf/ensemble_llama3.1_internlm2.5.yaml`.

Then launch the ensemble decoding process:

```bash
python src/main_many_ensemble_llama_series_local_matrix.py \
    --conf conf/ensemble_llama3.1_internlm2.5.yaml \
```