# QuaRot Fixes and Demo

This branch contains fixes for compatibility issues with newer versions of transformers library and demonstrates successful QuaRot quantization on LLaMA-2-7B.

## Main Fixes

### 1. Attention Mask Shape Compatibility (`fake_quant/gptq_utils.py` and `fake_quant/eval_utils.py`)
- Fixed attention mask shape mismatches when using transformers >= 4.38.0
- Added proper batch size and sequence length handling
- Handles both 2D and 4D attention masks correctly

### 2. Attribute Access Error (`fake_quant/quant_utils.py`)
- Fixed `cos_cached` attribute error in newer transformers versions
- Added try-except block to handle deprecated attributes gracefully

## Demo Results

We successfully ran QuaRot on LLaMA-2-7B with the following results:

| Configuration | WikiText-2 PPL | Memory Reduction |
|--------------|----------------|------------------|
| Baseline (FP16) | 5.810 | - |
| W4A16K4V4 + Rotation | 5.757 | 75% |
| W4A4K4V4 + Rotation | 6.369 | 75% + activation savings |

## How to Run

1. Ensure you have the conda environment activated:
   ```bash
   conda activate quarot
   ```

2. Run fake quantization demo:
   ```bash
   cd fake_quant
   
   # For W4A16K4V4 (recommended):
   python main.py --model meta-llama/Llama-2-7b-hf \
     --rotate --a_bits 16 --v_bits 4 --k_bits 4 --w_bits 4 --w_clip
   
   # For full W4A4K4V4:
   python main.py --model meta-llama/Llama-2-7b-hf \
     --rotate --a_bits 4 --v_bits 4 --k_bits 4 --w_bits 4 --w_clip
   ```

## Files Modified

- `fake_quant/gptq_utils.py`: Fixed attention mask handling in GPTQ quantization
- `fake_quant/eval_utils.py`: Fixed attention mask handling in evaluation
- `fake_quant/quant_utils.py`: Fixed attribute access for newer transformers

## Additional Files

- `demo_results.py`: Summary of our experimental results
- `CLAUDE.md`: Project instructions for Claude Code assistant

## Environment

- Python 3.10
- PyTorch 2.2.1 with CUDA 11.8
- Transformers 4.38.0
- CUDA 12.8 (backward compatible)
- GPUs: 2x NVIDIA H100 80GB

## Notes

- The fixes ensure compatibility with transformers 4.38.0+
- All modifications preserve the original QuaRot algorithm
- Results demonstrate successful 4-bit quantization with minimal performance loss