# Dictionary Training for SparseCache

This module provides dictionary learning capabilities for extreme KV cache compression via sparse coding over universal dictionaries. The implementation supports training layer-specific KV dictionaries through direct gradient-based optimization.

**Based on**: This code is based on the [Lexico codebase](https://github.com/krafton-ai/lexico) by KRAFTON AI, which implements the research paper "Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries" ([arXiv:2412.08890](https://arxiv.org/abs/2412.08890)).

## Table of Contents

- [Overview](#overview)
- [Features](#features)
- [Project Structure](#project-structure)
- [Installation](#installation)
- [Usage](#usage)
- [Configuration](#configuration)
- [Output Structure](#output-structure)
- [Monitoring Training](#monitoring-training)

## Overview

This dictionary training system learns sparse representations of key-value (KV) cache data from large language models. It uses an autoencoder architecture with Orthogonal Matching Pursuit (OMP) for sparse encoding to compress KV cache data while maintaining reconstruction quality.

### Key Components:

- **Autoencoder Model**: Learns dictionary representations with configurable sparsity
- **OMP Encoder**: Provides sparse coding with orthogonal matching pursuit
- **Configuration Management**: Flexible path and parameter configuration
- **Training Pipeline**: Comprehensive training with validation and logging

## Features

- **Sparse Dictionary Learning** with configurable sparsity levels
- **Comprehensive Logging** with TensorBoard integration
- **Flexible Configuration** with command-line parameter support
- **Configurable Paths** for different deployment environments
- **Train/Test Split** with proper data separation
- **Checkpoint Management** with epoch-wise dictionary saving
- **GPU Support** with automatic device detection

## Project Structure

```
dictionary_training/
├── README.md                 # This file
├── requirements.txt          # Python dependencies
├── setup.py                 # Package setup configuration
├── train.sh                 # Main training script
├── sparsecache/                 # Core implementation
│   ├── __init__.py
│   ├── omp.py              # Orthogonal Matching Pursuit implementation
│   └── dictionary_learning/
│       ├── __init__.py
│       ├── train.py        # Main training script
│       ├── model.py        # Autoencoder model definitions
│       ├── utils.py        # Data loading utilities
│       ├── config.py       # Configuration management
│       └── path_utils.py   # Path handling utilities
├── checkpoints/            # Model checkpoints (created during training)
├── dictionaries_s{N}/      # Saved dictionaries by sparsity level
├── runs_s{N}/             # TensorBoard logs by sparsity level
└── lexico.egg-info/       # Package metadata
```

## Installation

### Setup

1. **Clone or navigate to the dictionary_training directory**:
   ```bash
   cd path/to/TurboRAG/dictionary_training
   ```

2. **Install dependencies**:
   ```bash
   pip install -r requirements.txt
   ```

3. **Install the package** (optional, for development):
   ```bash
   pip install -e .
   ```

## Usage

### Quick Start

The easiest way to start training is using the provided shell script:

```bash
# Run the default training configuration
bash train.sh
```

### Manual Training

For more control, run the training script directly:

```bash
python sparsescache/dictionary_learning/train.py \
    --model_name_or_path "Qwen2.5-7B-Instruct_1M_wiki3M" \
    --dictionary_size 8192 \
    --sparsity 64 \
    --concat 1 \
    --num_epochs 30 \
    --batch_size 2048 \
    --lr 0.0005
```

### Key Parameters

| Parameter | Description | Default | Example |
|-----------|-------------|---------|---------|
| `--model_name_or_path` | Model identifier for data loading | **Required** | `"Qwen2.5-7B-Instruct_1M_wiki3M"` |
| `--dictionary_size` | Size of the learned dictionary | `4096` | `8192` |
| `--sparsity` | Sparsity level for encoding | `8` | `64` |
| `--num_epochs` | Number of training epochs | `20` | `30` |
| `--batch_size` | Training batch size | `64` | `2048` |
| `--lr` | Learning rate | `0.0001` | `0.0005` |
| `--concat` | Number of concatenated layers | `1` | `2` |
| `--use_norm` | Apply normalization to training loss | `False` | `--use_norm` |

### Path Configuration (Optional)

You can customize where outputs are saved:

```bash
python sparsescache/dictionary_learning/train.py \
    --model_name_or_path "Qwen2.5-7B-Instruct_1M_wiki3M" \
    --dictionary_size 8192 \
    --sparsity 64 \
    --checkpoint_dir "./my_checkpoints" \
    --dictionary_dir "./my_dictionaries" \
    --runs_dir "./my_tensorboard_logs" \
    --data_base_dir "/custom/data/path"
```

## Configuration

### Data Requirements

The training system expects preprocessed KV data files in the following format:

- **Key files**: `*_key_*.pt` (PyTorch tensor files)
- **Value files**: `*_value_*.pt` (PyTorch tensor files)
- **Location**: Default `/data/llm/tmp` (configurable via `--data_base_dir`)

**Note**: Instructions for dataset preparation are not included in this version. Please ensure your data files follow the expected naming convention and format.

### Environment Variables

You can also configure paths using environment variables:

```bash
export DICTIONARY_CHECKPOINT_DIR="/path/to/checkpoints"
export DICTIONARY_DATA_DIR="/path/to/data"
# Then run training normally
bash train.sh
```

### Training Configuration Examples

**Small-scale testing**:
```bash
python sparsescache/dictionary_learning/train.py \
    --model_name_or_path "Qwen2.5-7B-Instruct_1M_wiki3M" \
    --dictionary_size 1024 \
    --sparsity 16 \
    --num_epochs 5 \
    --batch_size 256
```

**Large-scale production**:
```bash
python sparsescache/dictionary_learning/train.py \
    --model_name_or_path "Qwen2.5-7B-Instruct_1M_wiki3M" \
    --dictionary_size 16384 \
    --sparsity 128 \
    --num_epochs 50 \
    --batch_size 4096 \
    --lr 0.001
```

## Output Structure

After training, you'll find the following outputs:

### Dictionaries
```
dictionaries_s{sparsity}/
└── {model_name}_N_{dict_size}_s_{sparsity}_f_{feature_dim}.pt
└── {model_name}_N_{dict_size}_s_{sparsity}_f_{feature_dim}_{epoch}epoch.pt
```

### Checkpoints
```
checkpoints/
└── {model_name}_N_{dict_size}_s_{sparsity}_f_{feature_dim}.pt
```

### TensorBoard Logs
```
runs_s{sparsity}/
└── {model_name}_N_{dict_size}_s_{sparsity}_f_{feature_dim}/
    └── events.out.tfevents.*
```

### Example Output Files

For a typical training run, you might see:
```
dictionaries_s64/
├── Qwen2.5-7B-Instruct_1M_wiki3M_N_8192_s_64_f_512.pt
├── Qwen2.5-7B-Instruct_1M_wiki3M_N_8192_s_64_f_512_0epoch.pt
├── Qwen2.5-7B-Instruct_1M_wiki3M_N_8192_s_64_f_512_1epoch.pt
└── ...

runs_s64/
└── Qwen2.5-7B-Instruct_1M_wiki3M_N_8192_s_64_f_512/
    └── events.out.tfevents.{timestamp}
```

## Monitoring Training

### TensorBoard

Launch TensorBoard to monitor training progress:

```bash
# Monitor specific sparsity level
tensorboard --logdir runs_s64

# Monitor all runs
tensorboard --logdir .
```

### Metrics Tracked

- **Loss/train**: Training reconstruction loss
- **Loss/test**: Test reconstruction loss
- **Loss/epoch_train**: Average training loss per epoch
- **Loss/epoch_test**: Average test loss per epoch
- **RelativeReconstructionError/train**: Training reconstruction error
- **RelativeReconstructionError/test**: Test reconstruction error
- **RelativeReconstructionError/epoch_train**: Average training error per epoch
- **RelativeReconstructionError/epoch_test**: Average test error per epoch

## Acknowledgments

This implementation is based on the [Lexico codebase](https://github.com/krafton-ai/lexico) developed by KRAFTON AI. We acknowledge their contributions to the field of KV cache compression and sparse coding research.

### Original Research

**Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries**
- Authors: Junhyuck Kim, Jongho Park, Jaewoong Cho, Dimitris Papailiopoulos
- Paper: [arXiv:2412.08890](https://arxiv.org/abs/2412.08890)
- Repository: [github.com/krafton-ai/lexico](https://github.com/krafton-ai/lexico)
---

**Last Updated**: January 2025
**Version**: 1.0.0
