# Layerwise GPTQ / ZSIC Quantization Pipeline

This repo implements layerwise quantization for LLMs with support for:

- **GPTQ** and **ZSIC** quantization methods
- **Qronos mode** for ZSIC - minimizes E[(WX - ŴX̂)²] using cross-covariance statistics
- **Binary search** for precise rate targeting
- **Global rate control** with per-weight-type budget multipliers
- **Hadamard rotation** for improved quantization (row, column, or both)
- **Resume support** for long-running jobs

---

## Installation

### 1. Clone with submodules

```bash
git clone --recurse-submodules https://github.com/your-repo/w-quant-new.git
cd w-quant-new
```

Or if already cloned:
```bash
git submodule update --init --recursive
```

### 2. Install dependencies

```bash
pip install torch numpy GPUtil matplotlib
```

### 3. Install fast-hadamard-transform (optional, for `--hadamard`)

Required only if you plan to use Hadamard rotation (`--hadamard` flag).

```bash
cd fast-hadamard-transform
pip install -v .
cd ..
```

This installs a CUDA-accelerated Hadamard transform supporting fp32, fp16, bf16 for dimensions up to 32768.

---

## Quick Start

### ZSIC (recommended)

```bash
CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nproc_per_node=1 \
  -m scripts.run_pipeline_job \
  --model "3-8B" \
  --method zsic \
  --target_rate 3 \
  --layer_end 32 \
  --hessian_batch_size 10 \
  --zsic_binary_search \
  --rate_control
```

### GPTQ

```bash
CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nproc_per_node=1 \
  -m scripts.run_pipeline_job \
  --model "3-8B" \
  --method gptq \
  --target_rate 4 \
  --layer_end 32 \
  --hessian_batch_size 16 \
  --percdamp 0.1
```

---

## Key Features

### GPTQ

GPTQ uses fixed-point quantization with `target_rate` controlling the number of quantization levels:
- `maxq = 2^(target_rate+1) - 1`
- `target_rate=4` → maxq=31 (32 levels)
- `target_rate=3` → maxq=15 (16 levels)

**Core compression function** (exact implementation from `yp_compress.py`):
- `compress_gptq(W, H, target_rate, groupsize, blocksize, percdamp, actorder)` - GPTQ with Cholesky-based inverse Hessian

**Implementation details:**
- Symmetric quantization (`sym=True`)
- Per-channel quantization parameters (`perchannel=True`)
- Blockwise error feedback propagation
- Works in float64 for numerical stability

Key parameters:
- `--percdamp 0.1` - Hessian damping (recommended)
- `--groupsize N` - Column grouping for scales (-1 = per-channel)
- `--actorder` - Reorder columns by activation magnitude

### ZSIC

ZSIC uses entropy-coded LDLQ (Lattice Decoding Quantization) achieving variable rates.

**Core compression functions**:
- `compress_w2q(W, Sig_X, target_rate, Sig_hX, Sig_X_hX)` - Unified LDLQ function: handles both standard mode (when Sig_hX=None) and Qronos mode (when Sig_hX and Sig_X_hX are provided)
- `find_optimal_rescalers2(...)` - Alternating optimization for diagonal T and Gamma (supports both standard and Qronos modes, requires float64)

**T/Gamma Rescaler Optimization**: After LDLQ encoding, diagonal row (T) and column (Gamma) rescalers are optimized to minimize:
```
J(T,Γ) = E[||WX - T·Ŵ·Γ·X||²]
```

Key parameters:
- `--zsic_binary_search` - Find target that achieves desired rate
- `--rate_control` - Maintain global bit budget across layers

### Binary Search (`--zsic_binary_search`)

ZSIC's actual rate (entropy) can differ from the target rate parameter.
Binary search finds the target_rate parameter that achieves the desired actual rate.

**How it works:**
1. **Fast rate estimation**: For each candidate target_rate, run LDLQ on a subset of rows (default: 10%) without T/Gamma optimization. This gives a quick entropy estimate.
2. **Binary search**: Iterate to find the target_rate that produces entropy closest to desired rate.
3. **Full compression**: Run `compress_w2q` with the best target_rate, including full T/Gamma optimization.

Options:
- `--zsic_binary_search_iters 10` - number of binary search iterations
- `--zsic_binary_search_row_fraction 0.1` - fraction of rows for fast rate estimation (default: 10%)

### Rate Control (`--rate_control`)

Maintains a global bit budget across all layers:
- Tracks consumed bits vs remaining budget
- Adjusts per-layer targets to hit global average
- Supports per-weight-type multipliers: `--rate_weight_budgets "wk:1.5,wq:1.25"`

### Qronos Mode (`--qronos`)

Standard quantization minimizes E[(WX - ŴX)²] assuming activations don't change.
Qronos minimizes E[(WX - ŴX̂)²] where X̂ are the actual activations from the
partially-quantized model. This requires computing three statistics:

- Σ_X = E[X X^T] - unquantized activations covariance
- Σ_X̂ = E[X̂ X̂^T] - quantized activations covariance
- Σ_{X,X̂} = E[X X̂^T] - cross-covariance

The quantization target becomes: `ŷ = W @ Σ_{X,X̂} @ (L^T)^{-1}` where `Σ_X̂ = L L^T`.

### Skip Quantization (`--skip_quantize`)

Keep specific layers in full precision:
```bash
--skip_quantize "0.wq,0.wk,1.wq,1.wk"
```

### Hadamard Rotation (`--hadamard`)

Apply Hadamard transform to weights and/or activations to improve quantization. Three types are available:

- **Row Hadamard** (`--hadamard_type row`): W → W @ H, Σ → H^T @ Σ @ H
  - Transforms columns of W, changes input covariance structure
  - Good for spreading outliers across weight columns

- **Column Hadamard** (`--hadamard_type column`): W → H @ W
  - Transforms rows of W, covariance unchanged
  - Good for spreading outliers across weight rows

- **Row + Column** (`--hadamard_type row_column`): W → H @ W @ H
  - Applies both transforms for maximum spreading

Example:
```bash
--hadamard --hadamard_type row_column --hadamard_seed 42
```

---

## Repo Layout

```
w-quant-new/
  quant_layerwise/
    pipeline.py           # Main quantization loop
    qronos_stats.py       # Qronos statistics (Σ_X, Σ_X̂, Σ_{X,X̂})
    hessian_runtime.py    # Hessian/covariance computation with caching
    rate_control.py       # Global rate budget tracking
    partial_model.py      # Load/apply quantized weights
    hadamard.py           # Hadamard transforms (row, column, row_column)
    methods/
      gptq.py             # GPTQ quantization
      zsic.py             # ZSIC/SIC quantization with Qronos support
    storage/
      artifacts.py        # LayerArtifact + RunManifest
  scripts/
    run_pipeline_job.py   # Main entry point
    run_eval_job.py       # Evaluation (PPL, KL)
    run_quant_sweep.py    # Quantization sweep over multiple rates
    run_eval_sweep.py     # Evaluation sweep with comparison plots
```

---

## Output Structure

```
$QUANT_BUCKET/quant_runs/{model}/{run_id}/
  manifest.json           # Tracks quantized layers
  layer_logs.jsonl        # Per-layer metrics (loss, entropy, rate)
  rate_control_state.json # Budget tracking state
  rate_summary.json       # Final rate statistics
  layers/
    layers.0.attention.wq.zsic.pt
    layers.0.attention.wk.zsic.pt
    ...
  qronos_stats/           # (if --qronos) Saved covariance matrices
    layers.0.attention.wq.pkl
    ...
```

---

## CLI Reference

```
python -m scripts.run_pipeline_job --help

Required:
  --model MODEL           Model name (e.g., "3-8B")
  --method {gptq,zsic}    Quantization method
  --target_rate RATE      Target bits per parameter

Layer Selection:
  --layer_begin N         First layer (default: 0)
  --layer_end N           Last layer (exclusive, default: 32)
  --weights W1,W2,...     Weight types (default: wq,wk,wv,wo,w1,w2,w3)

Calibration:
  --seqlen N              Sequence length (default: 2048)
  --calib_nsamples N      Number of calibration samples (default: all)
  --hessian_batch_size N  Batch size for Hessian computation (default: 1)

GPTQ Options:
  --groupsize N           Group size (default: -1 = per-channel)
  --blocksize N           Block size (default: 128)
  --percdamp F            Hessian damping (recommended: 0.1)
  --actorder              Activation order heuristic

ZSIC Options:
  --zsic_binary_search    Enable binary search for target rate
  --zsic_binary_search_iters N
  --zsic_binary_search_row_fraction F
  --zsic_percdamp F       Hessian damping (default: 0.0001)
  --qronos                Enable Qronos mode

Rate Control:
  --rate_control          Enable global rate budget
  --global_rate_bits F    Global target (default: --target_rate)
  --rate_weight_budgets   Per-weight multipliers (e.g., "wk:1.5,wq:1.25")
  --rate_xmin F           Minimum allowed rate (default: 0.05)
  --rate_xmax F           Maximum allowed rate (default: 16.0)

Hadamard:
  --hadamard              Enable Hadamard rotation
  --hadamard_type TYPE    Type: row (W@H), column (H@W), row_column (both)
  --hadamard_seed N       Random seed (default: 0)

Output:
  --run_root PATH         Output directory (default: quant_runs)
  --run_id ID             Run identifier
  --resume / --no_resume  Resume from existing artifacts
```

---

## Evaluation

```bash
python -m scripts.run_eval_job \
  --run_dir $QUANT_BUCKET/quant_runs/3-8B/run_id \
  --eval_nsamples 128 \
  --seqlen 2048
```

Computes: `ppl_quant`, `ppl_ref`, `kl_ref_to_quant`

Use `--ppl_only` if you can't fit two models in memory.

---

## Environment Setup

Set the `QUANT_BUCKET` environment variable to specify where quantization runs are stored:

```bash
export QUANT_BUCKET=/path/to/quant-bucket
```

All sweep outputs, manifests, and plots will be saved under this path.

---

## Sweep Scripts

The sweep scripts automate running quantization and evaluation over multiple target rates, with automatic GPU scheduling and comparison plots.

### Quantization Sweep

Run quantization over a range of target rates. Creates a sweep manifest file that links all runs.

```bash
# ZSIC sweep: rates 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5
python -m scripts.run_quant_sweep \
  --model 3-8B \
  --method zsic \
  --rate_min 0.5 \
  --rate_max 3.5 \
  --rate_step 0.5

# GPTQ sweep with explicit rates and groupsize
python -m scripts.run_quant_sweep \
  --model 3-8B \
  --method gptq \
  --rates "1,2,3,4" \
  --groupsize 128

# ZSIC sweep with Qronos mode
python -m scripts.run_quant_sweep \
  --model 3.2-1B \
  --method zsic \
  --rate_min 1.0 \
  --rate_max 4.0 \
  --qronos

# Specify GPUs explicitly
python -m scripts.run_quant_sweep \
  --model 3.2-1B \
  --method zsic \
  --rate_min 1.0 \
  --rate_max 4.0 \
  --gpus "0,1,2,3"
```

Supported models: `3.2-1B` (16 layers), `3-8B` (32 layers)

**Options:**
- `--model` - Model name (required)
- `--method {zsic,gptq}` - Quantization method (required)
- `--rate_min`, `--rate_max`, `--rate_step` - Rate range (default: 0.5 to 3.5 step 0.5)
- `--rates "1.0,2.0,3.0"` - Explicit comma-separated rates (overrides min/max/step)
- `--groupsize N` - GPTQ group size (default: -1 = per-channel)
- `--qronos` - Enable Qronos mode for ZSIC
- `--hessian_batch_size N` - Batch size for Hessian computation (default: 32)
- `--gpus "0,1,2"` - Comma-separated GPU IDs (default: auto-detect free GPUs)
- `--run_root PATH` - Output directory (default: quant_runs)

Output: `$QUANT_BUCKET/quant_runs/{model}/sweeps/sweep_{method}_{timestamp}.json`

### Evaluation Sweep

Run evaluations on sweep runs and generate comparison plots. Uses sweep manifest files to know which runs to evaluate.

```bash
# Evaluate a specific sweep (manifest path printed by run_quant_sweep)
python -m scripts.run_eval_sweep \
  --sweep $QUANT_BUCKET/quant_runs/3-8B/sweeps/sweep_zsic_20260120_123456.json \
  --eval --plot

# Compare multiple sweeps (ZSIC vs GPTQ)
python -m scripts.run_eval_sweep \
  --sweep $QUANT_BUCKET/quant_runs/3-8B/sweeps/sweep_zsic_20260120.json \
  --sweep $QUANT_BUCKET/quant_runs/3-8B/sweeps/sweep_gptq_20260120.json \
  --eval --plot

# Auto-discover all sweeps for a model
python -m scripts.run_eval_sweep \
  --model 3-8B \
  --run_root quant_runs \
  --eval --plot

# Just plot (skip eval if eval.json files already exist)
python -m scripts.run_eval_sweep \
  --sweep $QUANT_BUCKET/quant_runs/3-8B/sweeps/sweep_zsic_20260120.json \
  --plot

# PPL-only evaluation (saves memory - no KL divergence)
python -m scripts.run_eval_sweep \
  --sweep $QUANT_BUCKET/quant_runs/3-8B/sweeps/sweep_zsic_20260120.json \
  --eval --ppl_only

# Force re-evaluation even if eval.json exists
python -m scripts.run_eval_sweep \
  --sweep $QUANT_BUCKET/quant_runs/3-8B/sweeps/sweep_zsic_20260120.json \
  --eval --force_reeval
```

**Options:**
- `--sweep PATH` - Path to sweep manifest (can specify multiple for comparison)
- `--model MODEL` - Auto-discover all sweeps for this model
- `--eval` - Run evaluations
- `--plot` - Generate comparison plots
- `--ppl_only` - Only compute perplexity (skip KL divergence)
- `--force_reeval` - Re-run evaluations even if eval.json exists
- `--seqlen N` - Sequence length for evaluation (default: 2048)
- `--eval_nsamples N` - Number of eval samples (default: all)
- `--gpus "0,1,2"` - Comma-separated GPU IDs (default: auto-detect)
- `--output_dir PATH` - Output directory for plots (default: $QUANT_BUCKET/quant_runs/plots)

**Output plots:**
- `$QUANT_BUCKET/quant_runs/plots/{model}_ppl_vs_rate.png` - Perplexity comparison
- `$QUANT_BUCKET/quant_runs/plots/{model}_kl_vs_rate.png` - KL divergence comparison
- `$QUANT_BUCKET/quant_runs/plots/{model}_sweep_results.json` - Raw results data

### Complete Workflow Example

```bash
# 1. Set up environment
export QUANT_BUCKET=/home/user/quant-bucket

# 2. Run ZSIC quantization sweep
python -m scripts.run_quant_sweep \
  --model 3-8B \
  --method zsic \
  --rate_min 0.5 \
  --rate_max 4.0 \
  --rate_step 0.5

# 3. Run GPTQ quantization sweep for comparison
python -m scripts.run_quant_sweep \
  --model 3-8B \
  --method gptq \
  --rate_min 1.0 \
  --rate_max 4.0 \
  --rate_step 0.5 \
  --groupsize -1

# 4. Evaluate all sweeps and generate comparison plots
python -m scripts.run_eval_sweep \
  --model 3-8B \
  --eval --plot
```

### Sweep Manifest Format

The sweep manifest JSON links quantization runs for evaluation:

```json
{
  "sweep_id": "sweep_zsic_20260120_123456",
  "model": "3-8B",
  "method": "zsic",
  "num_layers": 32,
  "rates": [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5],
  "runs": [
    {
      "rate": 0.5,
      "run_id": "3-8B.zsic.r0.50",
      "run_dir": "/path/to/quant-bucket/quant_runs/3-8B/3-8B.zsic.r0.50"
    },
    ...
  ],
  "created_at": "2026-01-20T12:34:56",
  "qronos": false
}
```
