## TAKE: Task-Aware Chunked KV Cache Eviction for Efficient Long-Context LLM Prefill

TAKE is an efficient inference and evaluation framework for long-context scenarios. It leverages task-aware KV cache management and chunked execution to significantly reduce memory usage while maintaining near-native model quality. The project provides complete evaluation pipelines for LongBench and Needle-in-a-Haystack, as well as performance benchmarks such as TTFT (time-to-first-token) prefill.



## Project Structure

```
TAKE/
├── take/
│   ├── __init__.py
│   ├── log_util.py
│   └── take/
│       ├── chunk.py                              # TAKE core parameter dataclass (TakeKwargs)
│       └── transformers_take/
│           ├── llama/modeling_llama_take.py
│           └── mistral/modeling_mistral_take.py
├── eval/
│   ├── longbench/
│   │   ├── main.py                               # LongBench inference entry
│   │   ├── evaluate.py                           # LongBench metrics evaluation
│   │   ├── evaluate_mistral.py
│   │   └── config/
│   │       ├── dataset2maxlen.json
│   │       ├── model2maxlen.json
│   │       ├── dataset2tao.json                  # Recommended hyperparameters per dataset
│   │       └── dateset2prompt_task.json          # Prompt templates (with task section)
│   └── needle/
│       ├── main.py                               # Needle evaluation entry
│       ├── evaluate_mistral.py
│       ├── visualize.py                          # Result visualization
│       └── PaulGrahamEssays/*.txt                # Corpus
├── performance_benchmark/
│   └── ttft_prefill.py                           # TTFT (prefill) benchmark
├── scripts/
│   ├── evaluate_longbench.sh
│   ├── evaluate_needle.sh
│   ├── ablation_needle.sh
│   └── run_ttft.sh
├── environment.yml
└── readme.md
```

## Installation

```bash
# Create environment via Conda
conda env create -f environment.yml
conda activate take
```

For best performance, use CUDA with FlashAttention2-enabled builds.

## Quick Start

### LongBench Evaluation

Scripted run (remember to replace model path):
```bash
bash scripts/evaluate_longbench.sh
```

Single-dataset example:
```bash
python -m eval.longbench.main \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --mode take \
  --version 1.0 \
  --dataset narrativeqa \
  --longbench_type longbench \
  --chunk_size 4096 \
  --kv_budget 512 \
  --warmup_layers 16 \
  --pooling avg
```

### Needle-in-a-Haystack Evaluation

Scripted run:
```bash
bash scripts/evaluate_needle.sh
```

Or run directly:
```bash
python -m eval.needle.main \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --mode take \
  --version 1.0 \
  --chunk_size 4096 \
  --kv_budget 512 \
  --warmup_layers 16 \
  --pooling avg
python eval/needle/visualize.py \
  --eval_path outputs/Llama-3.1-8B-Instruct/needle/take/1.2/_cs4096_ks15_b512_tql15_wl16_sink4_cw8_alpha0.32_avg
```

### Performance Benchmark: TTFT Prefill

```bash
bash scripts/run_ttft.sh
```

Or run directly:
```bash
python performance_benchmark/ttft_prefill.py \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --mode take \
  --version 1.0 \
  --seqlen 131072 \
  --kv_budget 512 \
  --chunk_size 4096 \
  --warmup_layers 16 \
  --test_performance True
```

The script prints mean TTFT (excluding min/max) and peak memory usage, and saves a JSON report to the current directory.

## Key Parameters (Common)

- `kv_budget`: Overall KV budget. Smaller saves memory.
- `kv_warmup_budget`: Warmup KV budget for the initial tokens/layers.
- `chunk_size`: Chunk size.
- `warmup_layers`: Keep more complete KV in early layers for stability.
- `alpha`: Task accumulation/intensity coefficient.


## Reproducibility

- LongBench:
```bash
bash scripts/evaluate_longbench.sh
```

- Needle:
```bash
bash scripts/evaluate_needle.sh
```

- TTFT Prefill:
```bash
bash scripts/run_ttft.sh
```


## Citation

If this project is useful for your research, please consider citing it. A BibTeX entry will be provided after the paper is published.
