# QKV Projections Require a Fraction of Their Memory
## PAMM: Point Approximate Matrix Multiplication


### Install experiment dependencies

```bash
pip install -e .
pip install -r requirements.txt
```

## Benchmark 1: Pre-Training LLaMA on C4 dataset

```bash
# LLaMA-60M
torchrun torchrun_main.py
    --model_config configs/llama_60m.json\
    --lr 1e-2 \  
    --group llama60 \  
    --batch_size 256 \  
    --total_batch_size 512 \  
    --num_training_steps 10000 \  
    --warmup_steps 1000 \  
    --weight_decay 0.0 \  
    --dtype bfloat16 \  
    --eval_every 1000 \  
    --optimizer adam \  
    --n_seeds 1 \  
    --memory_efficient \  
    --rank 0.00390625 \  
    --proj_type pamm \  
    --update_proj_gap 1 \ 
    --name llama-60m-rank-1_256 \  
    --scale 0.25 \  
    --scale_o_proj 4 \ 
    --single_gpu 

```

```bash
# LLaMA-1B
torchrun --nprocs_per_node=2 torchrun_main.py \
    --model_config configs/llama_1b.json \
    --lr 3e-3 \
    --group llama1b \
    --batch_size 64 \
    --total_batch_size 512 \
    --num_training_steps 100000 \
    --warmup_steps 10000 \
    --weight_decay 0.0 \
    --dtype bfloat16 \
    --eval_every 1000 \
    --optimizer adam \
    --n_seeds 1 \
    --memory_efficient \
    --rank 0.00390625 \
    --proj_type pamm \
    --update_proj_gap 1 \
    --name llama-1b-rank-1_256 \
    --scale 0.25 \
    --scale_o_proj 4
```

## Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks
```bash
python run_glue.py \
    --model_name_or_path roberta-base \
    --n_seeds 3 \
    --group finetuning_roberta-base \
    --max_length 512 \
    --per_device_train_batch_size 16 \
    --num_train_epochs 30 \
    --report_to wandb \
    --with_tracking \ 
    --checkpointing_steps 10000 \
    --output_dir results/ft/m_roberta_base \
    --proj_type pamm \
    --memory_efficient \
    --update_proj_gap 1 \
    --rank 0.00390625 \
    --scale 2 \
    --name mnli_lr_1e-5_rank_1_256 \
    --task_name mnli \
    --lr 1e-5
```