# d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation 🚀

This is the official implementation of the paper d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation, where we introduce a novel recipe for building an ultra-fast diffusion language model named ***d3LLM*** (_pseuDo-Distilled Diffusion LLM_) 🚀.


## 📖 What is d3LLM?

**d3LLM** (_pseuDo-Distilled Diffusion LLM_) is a novel framework for building ultra-fast diffusion language models with negligible accuracy degradation. d3LLM achieves **5× speedup** over autoregressive models on H100 GPUs while maintaining competitive performance.


## 🎯 Getting Started

### Installation

```bash
# Install dependencies
# It is important to check the version of transformers==4.49.0, lm_eval==0.4.9, datasets==3.2.0, and flash_attn==2.7.4.post1
pip install -r requirements.txt
```


## 🔬 How d3LLM Works

The d3LLM framework combines two key innovations:


### (i) Pseudo-Trajectory Distillation 📚

Instead of random masking, we extract the teacher model's decoding order—the sequence in which it unmasks tokens. This pseudo-trajectory guides the student model to learn efficient generation patterns.

- **Pseudo-Trajectory Extraction** → 18% TPF improvement
- **Progressive Noise Schedule** → Additional 12% TPF boost
- **Progressive Window Sizing** → Another 8% TPF gain

<div align="center">

![Distillation Process](asset/imgs/fig_distillation.png)
*Our pseudo-trajectory-based distillation*

</div>


### (ii) Multi-Block Decoding Strategy ⚡

We enable parallel decoding across multiple blocks simultaneously using entropy-based token selection.

- **Entropy-Based Multi-Block Decoding** → 30% TPF improvement
- **KV-Cache with Periodic Refresh** → 35% TPS boost in long contexts
- **Early Stopping on EOS** → 5% TPF gain

<div align="center">

![Multi-Block Decoding](asset/imgs/fig_decoding.png)
*Entropy-based multi-block decoding with KV-cache and refresh.*

</div>

Together, these innovations achieve **5-10× speedup** on TPF (tokens per forward) over vanilla diffusion models while maintaining accuracy. Based on the d3LLM framework, we have released three models: d3LLM-LLaDA, d3LLM-Dream, and d3LLM-Coder.

## 🏋️‍♀️ Training d3LLM Models

We provide the training scripts for d3LLM-Dream and d3LLM-LLaDA. You can use the following commands to train the models.

```bash
# Training d3LLM-Dream
deepspeed --num_gpus=4 d3llm/d3llm_DREAM/distill_2_training/d3llm_dream_train.py

# Training d3LLM-LLaDA
deepspeed --num_gpus=4 d3llm/d3llm_LLaDA/distill_2_training/d3llm_llada_train.py
```

using the script in `distill_1_data_prepare/` folder.


## 📊 Benchmark Results

Our d3LLM achieves the highest AUP (_Accuracy Under Parallelism_) scores across multiple dLLMs and tasks:

<div align="center">

<table>
<tr>
<td align="center"><img src="asset/imgs/data_llada_aup_radar.png" width="100%"/><br/><b>LLaDA-based Models</b></td>
<td align="center"><img src="asset/imgs/data_dream_aup_radar.png" width="100%"/><br/><b>Dream-based Models</b></td>
<td align="center"><img src="asset/imgs/data_dream_coder_aup_radar.png" width="100%"/><br/><b>Coder Models</b></td>
</tr>
</table>

*Radar plots comparing AUP scores across different methods and benchmarks*

</div>

### Acceleration Highlights (on GSM8K-CoT Dataset)

<div align="center">

| Model | H100's TPS | A100's TPS | Speedup vs. AR |
|-------|:----------:|:----------:|:---------------:|
| Qwen-2.5-7B (AR) | 57.32 | 50.36 | 1.00× |
| d3LLM-LLaDA | **288.89** | **183.33** | **3.47×~5.04×** |
| d3LLM-Dream | **235.34** | **128.19** | **2.55×~4.67×** |

</div>
