# Delta Attention

NeurIPS 2025 submission.

**No commercial usage allowed.**

## Structure

- `submssion/delta`: Delta for SGLang implementation
- `submission/delta-poc`: Experimental repository of Delta
- `submission/MInference`: Delta integration for MInference
- `submission/sglang`: Delta Sglang integration
- `submission/pg19-hierarchical-qa`: Our LongQA dataset for LongPPL

## Files to check

### Paged Attention Implementaion
- `hip-attention/src/hip_attn/v1_2/paged_hip.py`
### LongPPL / PPL measurements
- `hip-attention-private/scripts/long_eval_experimental.sh`
### Latency measurements
- `hip-attention-private/src/hip_research/main/bench_recompute.py`
- `MInference/test.py`

## How to run SGLang server

Python 3.10 is suggested

1. Install `delta` (Package name is `hip_attn`)
2. Install `sglang`
3. Execute server commands

```bash
# HiP delta
HIP_DELTA_ATTENTION_ARGS=recompute_dense-window_0-diff_1-w_64-decode_dense
# HiP recompute
HIP_DELTA_ATTENTION_ARGS=recompute_dense-window_0-diff_0-w_64-decode_dense
# HiP
HIP_DELTA_ATTENTION_ARGS=recompute_dense-window_0-diff_0-w_64-decode_dense-JUST_RETURN

# SLLM delta
HIP_DELTA_ATTENTION_ARGS=recompute_dense-window_2048-diff_1-w_64-decode_dense
# SLLM recompute
HIP_DELTA_ATTENTION_ARGS=recompute_dense-window_2048-diff_0-w_64-decode_dense
# SLLM
HIP_DELTA_ATTENTION_ARGS=recompute_dense-window_2048-diff_0-w_64-decode_dense-JUST_RETURN

# Llama 3.1 8b server command
HIP_DELTA_ATTENTION_ARGS=recompute_dense-window_2048-diff_1-w_64-decode_dense\
  HIP_DISABLE_FLASHDECODE=1\
  HIP_HEAD_REDUCE=0\
  HIP_DEBUG_LAST_DENSE=64\
  CUDA_LAUNCH_BLOCKING=0\
  HIP_DEBUG=0\
  python -m sglang.launch_server\
    --model-path meta-llama/Llama-3.1-8B-Instruct\
    --port 20000\
    --tp 2\
    --max-total-tokens 131072\
    --context-length 131072\
    --cuda-graph-bs 1\
    --max-running-req 1\
    --chunked-prefill-size -1\
    --enable-hip-attention\
    --hip-attention-config '{"using_extend": false, "dense_layers": [0, 1, 2]}'\
    --disable-radix-cache\
    --attention-backend flashinfer\
    --enable-hip-attention\
    --disable-cuda-graph

# Llama 4 Scout server command
HIP_DELTA_ATTENTION_ARGS=recompute_dense-window_0-diff_1-w_64-decode_dense\
  PASSKEY_LEN=35\
  HIP_DISABLE_FLASHDECODE=0\
  HIP_HEAD_REDUCE=0\
  HIP_DEBUG_LAST_DENSE=-1\
  python -m sglang.launch_server\
    --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct\
    --tp 8\
    --max-total-tokens 400000\
    --context-length 400000\
    --cuda-graph-bs 1\
    --max-running-req 1\
    --chunked-prefill-size -1\
    --enable-hip-attention\
    --hip-attention-config '{"using_extend": false, "dense_layers": [0, 1, 2, 3]}'\
    --attention-backend flashinfer\
    --disable-radix-cache\
    --port 33330\
    --disable-cuda-graph
```
