<div align="center">
<h1> Grouped-head latenT Attention </h1>
</div>

## 🚀 Overview

Attention mechanisms drive LLM success but create computational and memory bottlenecks that scale rapidly with sequence length. We observe substantial redundancy in attention: KV cache can be compressed significantly and attention maps across heads show high similarity.


## ⚙️ Training Details
To reproduce the training curve and performance, you can use the `run_clm.py` provided by HuggingFace. The exact training hyperparameter are as follows:

```python

# content of run.sh

DISTRIBUTED_ARGS="
	--nproc_per_node $GPUS \
	--nnodes $SLURM_NNODES \
	--node_rank $SLURM_NODEID \
	--rdzv_endpoint $ADDR:$PORT \
	--rdzv_conf=join_timeout=36000000,read_timeout=3600000,timeout=36000000 \
    "


eval_options=" \
	--per_device_eval_batch_size $EVAL_BS \
	--do_eval \
	--evaluation_strategy steps \
	--max_eval_samples $MAX_EVAL_SAMPLE  \
	--eval_steps $EVAL_STEP "


clm_options=" \
	--train_file $DATA \
	--trust_remote_code true \
	--experiment_id $DATE \
	--report_to wandb \
	--block_size $BLOCK_SIZE \
	--preprocessing_num_workers 64 \
	--dataloader_num_workers 10 \
	--learning_rate $LR \
	--logging_steps 1 \
	--num_train_epochs $EPOCH \
	--bf16 true \
	--config_name $CONFIG \
	--tokenizer_name $CONFIG \
	--model_type $MODEL_TYPE \
	--per_device_train_batch_size $MICRO_BATCH \
	--gradient_accumulation_steps $BATCH_ACC \
	--optim adamw_hf \
	--lr_scheduler_type cosine \
	--warmup_ratio $WARM_RATIO \
	--gradient_checkpointing true \
	--save_strategy steps \
	--save_steps $SAVE_STEP \
	--deepspeed $DEEPSPEED \
	--overwrite_output_dir \
	--output_dir $SAVED_PRETRAIN_CHECKPOINT_PATH \
	--cache_dir $CACHE \
	--do_train \

SCRIPTS="run_clm_run.py"
run_cmd="torchrun $DISTRIBUTED_ARGS $SCRIPTS ${clm_options} ${eval_options}"

echo ${run_cmd}
eval ${run_cmd}

```

## 🛠️ LLM Viewer Analysis

We provide a comprehensive analysis framework based on [LLM Viewer] to evaluate the computational efficiency of different attention mechanisms. Our modified implementation is available in the `llm_viewer/` directory.

### Usage
To reproduce the efficiency analysis results:

```bash
cd llm_viewer
bash cal.sh
```


## 🔮 Future Works

- [ ] **Scaled Training Data Models**: Release GTA models trained on larger datasets via Hugging Face to demonstrate performance at scale
- [ ] **Multi-Scale Model Family**: Deploy GTA models across different parameter scales (3B, 7B, 13B) on Hugging Face for comprehensive evaluation
- [ ] **Efficient Inference Implementations**: Develop optimized GTA implementations for llama.cpp and vLLM to enable high-performance deployment



