# Learning-Augmented Streaming Algorithms for Correlation Clustering

This is the code for the NeurIPS'25 paper *"Learning-Augmented Streaming Algorithms for Correlation Clustering"* by Yinhao Dong, Shan Jiang, Shi Li, and Pan Peng.


## Getting Started

Download the required datasets from the [Stanford Large Network Dataset Collection](https://snap.stanford.edu/data/), and extract them into the corresponding subdirectories under the `data` folder.

## Preprocessing
### For SBM, Facebook, EmailCore, and LastFM Datasets

Navigate to the `preprocessing_general` folder and run the following commands:
```bash
python preprocess.py
python predict.py
```
These scripts will preprocess the datasets and generate the prediction files used in our algorithms.

### For SBM Datasets with $n=1200, 2400, 3600$ and the DBLP Dataset
Navigate to the `preprocessing_binary` folder and run the following commands:
```bash
# Step 1: For the DBLP dataset only
python sample_community_relations.py

# Step 2: For all applicable datasets
python process.py
python gen_edges.py
python predict.py
```
These scripts will preprocess the above datasets and generate the prediction files used in our algorithms.

## Evaluation
Navigate to the `algorithms` folder. To evaluate our algorithms on SBM, Facebook, EmailCore, and LastFM datasets, run `python main_general.py`. To evaluate our algorithms on SBM Datasets with $n=1200, 2400, 3600$ and the DBLP dataset, run `python main_binary.py`.

## Notes
Some scripts or functions are intended to handle **only one dataset at a time**. Please make sure to **comment or uncomment** the relevant sections accordingly based on the dataset you are using.