# Correlation Clustering Implementation

This package contains the implementation of a Pivot, PrunedPivot, C^4Approx, and SimpleSampling, with various optimization techniques and experimental frameworks.

## Core Algorithm Files

**graph_cc_cost_working.hpp**
- Main header file containing the core algorithm implementation
- Includes functions for cost computation, clustering with different space budgets, and experimental harnesses

**util.cpp**
- Utility functions for general-purpose operations
- Included by the main algorithm header

**serialization.cpp**
- Handles reading and writing graph data in binary (.bin) and NumPy (.npz) formats
- Provides deserialization functions for input graphs

**MurmurHash3.cpp, MurmurHash3.h**
- Third-party MurmurHash3 implementation for hashing operations
- Used internally by the algorithm

## Experimental Programs

**graph-cc-k-vs-error.cpp**
- Experimental program measuring clustering quality (error) as a function of passes (k)
- Outputs results in JSON format to the `results/` directory

**graph-cc-space-vs-error.cpp**
- Experimental program measuring clustering quality (error) as a function of memory/space budget
- Tests multiple space budget levels to evaluate space-accuracy tradeoffs

**simple-sampler-space-vs-error.cpp**
- Alternative experimental implementation using a simple sampling approach
- Compares space vs. error for the sampling-based variant

**graph_cc_cost_working_run.cpp**
- Main batch runner for experiments
- Aggregates results from multiple experimental configurations

## Utility Programs

**build-embedding-graph.cpp**
- Preprocesses embedding data to construct similarity graphs
- Converts embedding vectors into edge lists for clustering

**prepare-data.cpp**
- Data preparation utilities for formatting input graphs

**check-graph.cpp**
- Validation tool for verifying graph data integrity

## Build and Execution

**Makefile**
- Standard build configuration for all C++ programs
- Requires: Eigen (linear algebra), CNPY (NumPy I/O)
- Use `make` to build all targets

**run-experiment.sh**
- Wrapper script for running a single experiment with a graph file
- Usage: `./run-experiment.sh <graph_file> [threshold]`

**run-all-experiments.sh**
- Batch runner executing experiments on multiple datasets
- Modify to customize which datasets are processed

## Data Processing Scripts

**save_embeddings_imagenet.py**
- Extracts embeddings from ImageNet21k using a pre-trained model
- Saves embeddings in PyTorch format for batch processing

**save_embeddings_imagenet_ddp.py**
- Distributed version of embedding extraction using DDP (Distributed Data Parallel)
- For large-scale embedding extraction across multiple GPUs

**torch-to-numpy.py**
- Converts PyTorch embedding tensors to NumPy format
- Intermediate conversion utility for processing

## Output

Experimental results are written to the `results/` directory as JSON files, with timestamped filenames and a `latest.txt` symlink for quick access.