# Improved Algorithms for Clustering with Distance Oracles

## Project Overview

This project studies clustering in settings where the exact distances between data points are not fully known. Instead of full access to the distance matrix, we assume access to two types of oracles:

- **Strong Oracle**: Provides the exact distance between any two points.
- **Weak Oracle**: Returns the correct distance with probability $1 - \delta$, and return an incorrect arbitrary value with probability $\delta$, where $\delta$ is between 0 and 0.5.

### Goal

The main goal is to understand how many queries to the strong oracle are needed to find a good approximation for the clustering problems:

- **$k$-means**
- **$k$-center**

### What the Project Includes

- Algorithms with theoretical guarantees on the number of strong oracle queries needed for both $k$-means and $k$-center clustering.
- Experimental validation of these guarantees.

### Experiments

The experiments are designed to test the algorithms in both real and synthetic settings:

- **Data**:
  - MNIST (real-world data)
  - SBM (synthetic data based on the Stochastic Block Model)

- **Setup**:
  - Corrupted distance matrices are created for various values of $\delta$.
  - The clustering algorithms are tested using both oracles to measure cost and number of strong queries used.

These experiments help evaluate how well the algorithms perform under noisy distance information and how efficiently they use the strong oracle.




----------------------------

# Scripts

# distance_mnist.py

This script processes the MNIST dataset and creates corrupted distance matrices.

## Steps

1. Load and Scale the MNIST Dataset
   - Read data from `mnist_train.csv`.

2. Dimensionality Reduction
   - Apply SVD to reduce to 50 dimensions.
   - Apply t-SNE to reduce to 2 dimensions.
   - Save the reduced datasets as:
     - `images_svd_50d`
     - `images_tsne_2d`

3. Generate Corrupted Distance Matrices
   - Use the `generate_distance_matrix` function.
   - Generate corrupted distance matrices for both reduced datasets.
   - For each dataset, create matrices with $\delta$ values:
     - $\delta$ = 0.1
     - $\delta$ = 0.2
     - $\delta$= 0.3
   - Save each matrix in `.npy` format.

Output files:
- `distance_matrix_tsne.npy`
- `distance_matrix_tsne_delta_{delta}.npy`
- `distance_matrix_svd_delta_{delta}.npy`




---------------------------

# `distance_sbm.py`

This script generates  Stochastic Block Model (SBM) datasets and creates corrupted distance matrices for these datasets.

## Steps

1. Generating SBM-based data:
   - Parameters: $k = 7$, $d = 7$, $n \in \{10{,}000, 20{,}000, 50{,}000\}$
   - Points in the $i$-th cluster are drawn from a Gaussian distribution $N(\mu^{(i)}, I)$
     - $\mu^{(i)}[i] = 10^5$
     - $\mu^{(i)}[j] = 0$ for $j \ne i$
   - Save each generated dataset

2. Generate corrupted distance matrices:
   - Use the `generate_distance_matrix` function
   - Corruption levels: 
     - $\delta = 0.1$
     - $\delta = 0.2$
     - $\delta = 0.3$
   - For each value of $n \in \{10{,}000, 20{,}000, 50{,}000\}$
   - Save each corrupted matrix in `.npy` format

## Output Files

- `dataset_{total_points}.npy`
- `distance_matrix_{total_points}_delta_{delta}.npy`


-----------------------



# `mnist_k_center.py`

This script runs the weak-greedy ball carving algorithm for the $k$-center problem on the MNIST dataset using corrupted distance matrices.

## Steps

1. Load the datasets:
   - `images_svd_50d`
   - `images_tsne_2d`

2. Load the corrupted distance matrix:
   - For a given corruption level $\delta$, load the corresponding corrupted distance matrix.

3. Compute the strong baseline:
   - Use the `farthest_point_traversal_kcenter` function.
   - Compute the cost of the strong baseline.

4. Compute $k$-center solution using weak and strong oracle:
   - Use the `k_center_with_oracle` function.
   - This function takes a parameter `S_size`.
   - Vary `S_size` until the function returns a valid result without errors.
   - It outputs:
     - Number of strong oracle queries
     - Cost of the $k$-center solution

5. Run for different settings:
   - For each dataset and $\delta$ value, run the procedure to evaluate:
     - Number of strong oracle queries
     - Cost of the $k$-center solution

## Key Functions

- `farthest_point_traversal_kcenter`: Computes the strong $k$-center baseline.
- `k_center_with_oracle`: Implements the weak-greedy ball carving algorithm with corrupted distance matrices.
- `sample_initial_centers`, `perturbed_closest_min`: Helper functions used inside the function `farthest_point_traversal_kcenter`.

## Output

- Average clustering cost: reported as `4 * R`
- Oracle query counts:
  - Average number of strong queries
  - Average number of weak queries


----------------------------


# sbm_k_centers.py

This script computes the number of strong oracle queries and the $k$-center cost for SBM-based datasets. It behaves similarly to `mnist_k_center.py`, but uses SBM datasets and their corresponding corrupted distance matrices.

## Steps

1. Load the SBM dataset:
   - Choose from datasets with $n \in \{10{,}000, 20{,}000, 50{,}000\}$

2. Load the corrupted distance matrix:
   - For a given $\delta$ value, load the corresponding distance matrix

3. Compute the strong baseline:
   - Use the `farthest_point_traversal_kcenter` function
   - Compute the cost of the strong baseline

4. Compute $k$-center solution with weak and strong oracle:
   - Use the `k_center_with_oracle` function
   - Vary the `S_size` parameter until a valid result is returned
   - Record:
     - Number of strong oracle queries
     - Cost of the $k$-center solution

5. Repeat the process:
   - For each value of $n$ and each $\delta \in \{0.1, 0.2, 0.3\}$

## Key Functions

- `farthest_point_traversal_kcenter`: Computes the strong $k$-center baseline
- `k_center_with_oracle`: Implements weak-greedy ball carving using oracle decisions
- `sample_initial_centers`, `perturbed_closest_min`: Helper functions used within the oracle logic

## Output

- Average clustering cost: reported as `4 * R`
- Oracle query counts:
  - Average number of strong queries
  - Average number of weak queries
    

------------------------------------



# tsne_kmeans.py 

This script performs k-means clustering on MNIST-based datasets using both strong and weak oracles. It evaluates clustering cost and strong oracle usage under various levels of corruption in the distance matrix.

1. Load the datasets:
   - `images_svd_50d`
   - `images_tsne_2d`

2. Load the corrupted distance matrix:
   - For a given value of $\delta$, load the corresponding corrupted distance matrix.

3. Compute the strong baseline:
   - Use the function `initialize_kmeans_plus_plus` to initialize centers.
   - Compute the clustering cost using the function `compute_cost`.

4. Compute the weak baseline:
   - Use the function `weak_initialize_kmeans_plus_plus` to initialize centers.
   - Compute the clustering cost using `compute_cost`.

5. Compute the k-means solution using weak and strong oracles:
   - Use the function `perturbed_k_means_plus_plus`.
   - This function processes data in batches of size $20,000$.
   - Uses a thread pool of $64$ threads to execute tasks concurrently.
   - Has a variable $t$, which controls the number of strong oracle queries.
   - Returns a set of centers stored in `center_weights`.
   - Uses the following helper functions:
     - `compute_single_perturbed_median`
     - `compute_initial_closest_centers`
     - `update_closest_centers`

6. Create weighted instances:
   - Use the function `perturbed_cost`.
   - Takes `center_weights` as input and assigns weights to each center.

7. Run weighted k-means++:
   - Use the function `weighted_kmeans`.
     - Input: `center_weights`
     - Output: selected_centers — the final set of k centers.

8. Final output procedure:
   - After completing steps 1–4, call `perturbed_k_means_plus_plus` with the dataset and corrupted distance matrix.
   - This returns:
     - `center_weights`
     - `no_so_point_queries` — the number of strong oracle queries used
   - Call `weighted_kmeans` with `center_weights` to get `selected_centers`.
   - Compute the final k-means cost using `compute_cost` with the dataset, selected_centers, and k.

9. Experiment with different values of t:
   - Multiply t with different constants to observe how the cost and number of strong oracle queries change.

10. Run for all settings:
    - Repeat the process for each dataset (`images_svd_50d`, `images_tsne_2d`) and for each $\delta$ value.



------------------------------------------
# sbm_k_means.py

This script computes the number of strong oracle queries and the $k$-means cost for SBM-based datasets. It behaves similarly to `mnist_kmeans.py`, but uses SBM datasets and their corresponding corrupted distance matrices.



1. Load the SBM dataset:
   - Choose from datasets with $n \in \{10{,}000, 20{,}000, 50{,}000\}$

2. Load the corrupted distance matrix:
   - For a given $\delta$ value, load the corresponding distance matrix

3. Compute the strong baseline:
   - Use the function `initialize_kmeans_plus_plus` to initialize centers.
   - Compute the clustering cost using the function `compute_cost`.

4. Compute the weak baseline:
   - Use the function `weak_initialize_kmeans_plus_plus` to initialize centers.
   - Compute the clustering cost using `compute_cost`.

5. Compute the k-means solution using weak and strong oracles:
   - Use the function `perturbed_k_means_plus_plus`.
   - This function processes data in batches of size $20,000$.
   - Uses a thread pool of $64$ threads to execute tasks concurrently.
   - Has a variable $t$, which controls the number of strong oracle queries.
   - Returns a set of centers stored in `center_weights`.
   - Uses the following helper functions:
     - `compute_single_perturbed_median`
     - `compute_initial_closest_centers`
     - `update_closest_centers`

6. Create weighted instances:
   - Use the function `perturbed_cost`.
   - Takes `center_weights` as input and assigns weights to each center.

7. Run weighted k-means++:
   - Use the function `weighted_kmeans`.
     - Input: `center_weights`
     - Output: selected_centers — the final set of k centers.

8. Final output procedure:
   - After completing steps 1–4, call `perturbed_k_means_plus_plus` with the dataset and corrupted distance matrix.
   - This returns:
     - `center_weights`
     - `no_so_point_queries` — the number of strong oracle queries used
   - Call `weighted_kmeans` with `center_weights` to get `selected_centers`.
   - Compute the final k-means cost using `compute_cost` with the dataset, selected_centers, and k.

9. Experiment with different values of t:
   - Multiply `t`  with different constants to observe how the cost and number of strong oracle queries change.

10. Run for all settings:
    - Repeat the process for each SBM based dataset with  $n \in \{10{,}000, 20{,}000, 50{,}000\}$ and for each $\delta$ value.




-------------------------------------------

## Requirements

- Python 3.x
- Required packages:
  - numpy
  - pandas
  - matplotlib
  - seaborn
  - scikit-learn
  - tqdm

--------------------------------------------


