# Quantum (Inspired) $D^{2}$-Sampling with Applications
This contains the code used to run the experimental investigations. The `qid2` static library provides implementations of the _Sample and Query Access_ Data structure as well as the `QI-k-means++` and `k-means++` seeding algorithms. The source code is present in the `code` subdirectory. The code in `main.cpp` contains an example experiment for reference. 
Compiler version used : g++ (GCC) 13.2.1
Please use a linux-based OS.

## Example Usage

```cpp
/* main.cpp */
#include "qid2.h"

int main()
{
    /* provide a seed */
    rng_seed(42); 

    vector<int> v = {1,2,3,4};
    
    /* Instantiate data structure*/
    SQVec sqv ; 
    sqv.build(v); 

    /* Output a sample*/
    cout << sqv.sample() << endl;

    return 0;
}
```

To compile your file along with the library, use : 

```g++ main.cpp -Icode -Lcode -lqid2 -o out```

```./out```

 If you are working in another folder use the path to the source code instead. 

 ## Brief Documentation

 ### Seeding
 To seed the random number generator, use `rng_seed(seed_val)`, where `seed_val` is an integer

### Sample & Query Access Data Structure
#### Vectors
Constructor : `SQVec sqv`

Instantiate : `sqv.build(v)` where v is a `vector<double>`

Query : `sqv.get(idx)` where idx is an `int`. Returns a `double`

Sample ; `sqv.sample()` returns random index $i$ (`int`) with probability $\frac{v_{i}^{2}}{\| v \|^2}$

Norm : `sqv.norm2()` returns $\| v\| ^ 2$ (`double`)

#### Matrices

Constructor : `SQMat A`

Instantiate : `A.build(a)` where a is a `vector<vector<double>>`

Get a vector of row norms : `A.row_vec` returns a `SQVec*`

Get a particular row : `A.rows[i]` returns the $i$th row as an `SQVec`


### QI-kmeans++

Input : `V` the data stored in the datastructure `SQMat` (as a reference), and `k`, the number of clusters . 
Output : `indices` a `vector<int>` which stores the indices of the k centres from the original data

### Example : 
```cpp
vector<vector<double>> data = {{1,2,3,4},{3,2,1,4},{5,6,1,3},{4,1,2,3}};
SQMat A;
A.build(data);
vector<int> indices = QIkpp(&A,2);
```


## Running an Experiment :


| k  | 2    | 3    | 4    | 5    | 6    | 7    | 8    | 9    | 10   |
|----|------|------|------|------|------|------|------|------|------|
| QI-k-means++ | 8.43 | 8.13 | 7.81 | 7.43 | 7.29 | 7.09 | 6.79 | 6.78 | 6.66 |
| k-means++    | 8.35 | 8.10 | 7.74 | 7.38 | 7.44 | 7.17 | 7.12 | 6.91 | 6.83 |

*Table 1: Clustering cost for binarized MNIST (costs are scaled down by a factor of 10^6)*

| k  | 2      | 3      | 4      | 5      | 6      | 7      | 8      | 9      | 10     |
|----|--------|--------|--------|--------|--------|--------|--------|--------|--------|
| QI-k-means++ | 538.82 | 253.41 | 112.53 | 96.04  | 83.72  | 64.92  | 58.62  | 56.25  | 49.92  |
| k-means++    | 773.26 | 227.50 | 115.33 | 93.45  | 83.21  | 60.14  | 57.98  | 53.99  | 50.35  |

*Table 2: Clustering cost for IRIS (costs are rounded to 2 decimal places)*

| k  | 2    | 3    | 4    | 5    | 6    | 7    | 8    | 9    | 10   |
|----|------|------|------|------|------|------|------|------|------|
| QI-k-means++ | 3.52 | 3.20 | 2.94 | 2.93 | 2.77 | 2.43 | 2.38 | 2.35 | 2.27 |
| k-means++    | 3.47 | 3.14 | 2.96 | 2.74 | 2.67 | 2.35 | 2.37 | 2.27 | 2.17 |

*Table 3: Clustering cost for DIGITS (scaled down by a factor of 10^6)*

| k  | 2    | 3    | 4    | 5    | 6    | 7    | 8    | 9    | 10   |
|----|------|------|------|------|------|------|------|------|------|
| QI-k-means++ | 1.89 | 3.19 | 0.97 | 1.36 | 1.44 | 1.24 | 0.17 | 0.20 | 0.47 |
| k-means++    | 1.93 | 3.29 | 1.90 | 1.44 | 1.57 | 1.09 | 0.73 | 0.32 | 0.52 |

*Table 4: Variance of Clustering cost for binarized MNIST (values are scaled down by a factor of 10^11)*

| k  | 2      | 3      | 4     | 5    | 6    | 7    | 8    | 9    | 10   |
|----|--------|--------|-------|------|------|------|------|------|------|
| QI-k-means++ | 246523 | 48805  | 335   | 830  | 236  | 99   | 72   | 55   | 51   |
| k-means++    | 423718 | 27544  | 420   | 561  | 765  | 54   | 93   | 35   | 47   |

*Table 5: Variance of Clustering cost for IRIS (values are rounded to 2 decimal places)*

| k  | 2    | 3    | 4    | 5    | 6    | 7    | 8    | 9    | 10   |
|----|------|------|------|------|------|------|------|------|------|
| QI-k-means++ | 3.77 | 4.05 | 1.44 | 5.96 | 1.10 | 2.37 | 1.10 | 0.57 | 1.07 |
| k-means++    | 1.91 | 2.00 | 2.24 | 3.92 | 2.28 | 0.38 | 1.34 | 1.51 | 1.06 |

*Table 6: Variance of Clustering cost for DIGITS (values are scaled down by a factor of 10^10)*



To run for example, the experiment on IRIS dataset, you can use the following commands : 
```
g++ main.cpp -Icode -Lcode -lqid2 -o out

./out data/iris_data.txt iris_out.csv
``````
The results can then be found in iris_out.csv . To change the values of k or the number of repetitions, please refer to the code in main.cpp. The results are mentioned in output files in the data folder as well as here in the tables. By using the above commands you can reproduce an output file . If you want to use your own data set replace the path to the data accordingly. Simplified commands can be found in the make file, and can be run using `make test_iris` or `make test data=<path/to/data> output=<path/to/output>` etc.

 We have also provided the direct executable test. To use it , follow the command `./test <path/to/data> <path/to/output>`