# QKMEANS: Fast k-means++ Seeding via Quantization

A C++ benchmarking framework for k-means++ seeding algorithms, including QKMEANS -- a near-linear time seeding method that combines HNSW approximate nearest neighbor search with rejection sampling.

## Requirements

**C++ (core benchmarks):**
- g++ with C++17 support
- OpenMP

**Python (experiments, plotting, analysis):**
```bash
pip install -r requirements.txt
```

Dependencies: `faiss-cpu`, `numpy`, `pandas`, `matplotlib`, `scikit-learn`, `tqdm`

## Building

```bash
make          # Build all runners (bin/run_single, bin/run_comparison, bin/run_sweep)
make debug    # Build with debug symbols (-O0 -g)
make clean    # Remove bin/
```

Compiles with `-O3 -march=native -ffast-math -std=c++17 -fopenmp`.

## Datasets

Datasets are space-separated text files (one point per line) stored in `datasets/`. Supported datasets:

| Dataset | Source | Dimensions |
|---------|--------|------------|
| MNIST | Handwritten digits | 784 |
| FashionMNIST | Clothing images | 784 |
| CIFAR-10 | Natural images | 3072 |
| CIFAR-100 | Natural images (fine) | 3072 |
| MNIST-CLIP | CLIP embeddings | 512 |
| FashionMNIST-CLIP | CLIP embeddings | 512 |
| CIFAR-10-CLIP | CLIP embeddings | 512 |
| CIFAR-100-CLIP | CLIP embeddings | 512 |
| HAR | Human activity recognition | 561 |
| SUSY | Particle physics | 18 |
| Reddit | Text embeddings | 512 |
| StackExchange | Text embeddings | 512 |

**Preprocessing:**
```bash
python datasets/preprocess.py npy2txt data.npy          # Convert numpy to text
python datasets/preprocess.py labels mnist               # Generate label file
python datasets/preprocess.py clip mnist                 # Generate CLIP embeddings
python datasets/preprocess.py info mnist                 # Print dataset statistics
```


## Running Benchmarks

### C++ Runners

**Single algorithm:**
```bash
./bin/run_single <algorithm> config.json
# Algorithms: kmeanspp, afkmc2, prone, pronecoreset, fastcoreset, rejectionlsh, qkmeans
```

**All algorithms on one dataset:**
```bash
./bin/run_comparison config.json
```

**Hyperparameter sweep:**
```bash
./bin/run_sweep config.json
```

### Python Runner

```bash
python scripts/run/run_benchmarks.py mnist --algorithms qkmeans,kmeanspp
python scripts/run/run_benchmarks.py --all    # Run all datasets
```

### Config Files

Generate configs from the template:
```bash
python configs/generate_configs.py                          # All configs
python configs/generate_configs.py --dataset mnist          # Single dataset
python configs/generate_configs.py --algorithm qkmeans      # Single algorithm
```

Pre-built configs are in `configs/benchmark/`. Config format:
```json
{
  "name": "mnist",
  "data_path": "datasets/mnist.txt",
  "labels_path": "datasets/mnist_labels.txt",
  "k_values": [10, 50, 100, 200, 500],
  "num_runs": 5,
  "m_values": [100],
  "ef_values": [50],
  "alpha_values": [0.01],
  "output_csv": "results/qkmeans_mnist.csv"
}
```

## Experiments

### Scaling Laws

Validates quantization-theoretic scaling: &beta;<sub>k</sub> ~ k<sup>&epsilon;</sup> and &eta;<sub>k</sub> ~ k<sup>&epsilon;/2</sup>.

```bash
python experiments/scaling.py mnist --k-values 5 10 50 100 250 500 --plot
```

### Rejection Rate Analysis

Measures how rejection sampling failure rate decreases with MCMC chain length m:

```bash
python experiments/rejection_rate.py --download mnist \
  --m-values 1 2 3 5 7 10 15 20 30 50 \
  --k-values 10 50 100 --n-runs 5 \
  --output-dir experiments/results/rejection_rate
```

### Noisy Scaling Laws

Scaling law behavior under varying noise-to-signal ratios:

```bash
python experiments/noisy_scaling.py -d datasets/mnist.txt --name mnist \
  --nsr-min 0 --nsr-max 2 --nsr-steps 20 \
  --k-values 5 10 50 100 250 500 --n-runs 3 --n-iter 20 \
  --output-dir experiments/results/noisy_scaling
```

### Intrinsic Dimension Estimation

Computes Levina-Bickel MLE intrinsic dimension:

```bash
python experiments/compute_intrinsic_dim.py --datasets mnist fmnist cifar10
```

### Aspect Ratio

```bash
python experiments/compute_aspect_ratio.py mnist
```

## Plotting and Analysis

### Benchmark Plots

```bash
python scripts/plot/plot_benchmark.py results/benchmark/comparison_mnist.csv
```

### Grid Plots (all datasets)

```bash
python experiments/plot_all_benchmarks.py
python experiments/plot_quality_vs_runtime.py
python experiments/plot_seeding_cost_vs_runtime.py
```

### Scaling Law Plots

```bash
python experiments/plot_beta_scaling_grid.py
python experiments/plot_eta_scaling_grid.py
```

### Intrinsic Dimension Plots

```bash
python experiments/plot_eps_vs_mle_intrinsic_dim.py
python experiments/plot_d_eps_vs_d_mle.py
```

### LaTeX Tables

```bash
python scripts/analysis/generate_tables.py results/benchmark/*.csv --output results/tables/
python experiments/generate_latex_table.py
```


## Project Structure

```
qkmeans/
├── src/
│   ├── core/           # Dataset, Lloyd's, clustering metrics
│   ├── algorithms/     # Seeding: kmeanspp, afkmc2, prone qkmeans, etc.
│   └── bin/            # C++ runners (run_single, run_comparison, run_sweep)
├── scripts/
│   ├── run/            # Python benchmark runner
│   ├── plot/           # Benchmark visualization
│   └── analysis/       # LaTeX tables and metrics
├── experiments/        # Experiment scripts and results
│   └── results/        # CSVs and plots from experiments
├── configs/
│   ├── benchmark/      # Per-algorithm per-dataset configs
│   ├── template.json   # Config template
│   └── generate_configs.py
├── datasets/           # Data files (.txt) and preprocess.py
├── external/           # hnswlib, nlohmann/json
└── results/            # Benchmark output CSVs, plots, and tables
```
