

# UniPruneBench: A VISUAL INPUT TOKEN COMPRESSION BENCHMARK FOR LARGE MULTIMODAL MODELS 
**ICLR 2026 submission** – supplementary code base


## 1. File map 

| File | Purpose |
|------|---------|
| `custom_qwenvl.py` | Monkey-patch for **Qwen2.5-VL**  |
| `custom_internvl.py` | Monkey-patch for **InternVL-2.5** |
| `run_qwenvl2_5.py` | Entry point – swaps forward hooks and launches VLMEvalKit |
| `run_internvl3.py` | Same for InternVL models |
| `utils.py` | `compute_attention_weight()` helper (eager mode) |
| `prune_registry.py` | Central dispatcher – maps string → prune function |
| `Random_PreLLM.py` | `random_pre_llm` – uniform drop before LLM |
| `GPrune_PreLLM.py` | `gprune_pre_llm` – graph-centrality drop |
| `DivPrune_PreLLM.py` | `divprune_pre_llm` – diversity maximisation |
| `Random_IntraLLM.py` | `random_intra_llm` – uniform drop inside LLM |
| `FastV_IntraLLM.py` | `fastv_intra_llm` – last-query attention score |
| `FitPrune_IntraLLM.py` | `fitprune_intra_llm` – self × cross-attention product |
| `DART_IntraLLM.py` | `dart_intra_llm` – key-state L1 + cosine reranking |
| `Pdrop_IntraLLM.py` | `pdrop_intra_llm` – per-layer attention pooling drop |

---

## 2. Environment installation
```bash
# Create the Conda environment with Python 3.12.3
conda create -n vlm-prune python=3.12.3 -y

# Activate the environment
eval "$(conda shell.bash hook)"
conda activate vlm-prune

# Install the packages using pip
pip install transformers==4.54.0
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install flash-attn==2.7.3
pip install timm==1.0.19

cd VLMEvalKit && pip install -e .
```

---

## 3. Run a single-dataset evaluation (example: MME, 88.9 % pruning)

### Qwen2.5-VL
```bash
export KEEP_RATIO=0.111       
export PRUNE_METHOD_PRE_LLM=divprune_pre_llm

python run_qwenvl2_5.py \
    --model Qwen2.5-VL-7B-Instruct \
    --data MME \
    --verbose
```

### InternVL-3
```bash
export KEEP_RATIO=0.111
export PRUNE_METHOD_INTRA_LLM=fitprune_intra_llm

python run_internvl3.py \
    --model InternVL3-8B \
    --data MME \
    --verbose
```

---

## 4. Time profiling
Add  
```bash
export METHOD_TIME=True     # prune subroutine GPU ms
export PREFILL_TIME=True    # whole pre-fill stage GPU ms
```
Raw timings are appended to `method_times_<method>.txt` and `prefill_times.txt`.

