# ARDG

This repository contains the source code for our paper: **Identify Dominators: The Key To Improve Large-Scale Maximum Inner Product Search**.

## 1 Abstract

We propose a novel approach that directly exploits the IP space's geometry, focusing on special vectors named dominators and subsequently introducing the Monotonic Relative Dominator Graph (MRDG), an IP-space-native, sparse, and strongly-connected graph designed for efficient MIPS, offering strong theoretical foundations. To ensure scalability, we further introduce the Approximate Relative Dominator Graph (ARDG), which retains MRDG’s benefits while significantly reducing indexing complexity. 

## 2 Competitors

* ip-NSW ([Paper](https://proceedings.neurips.cc/paper_files/paper/2018/file/229754d7799160502a143a72f6789927-Paper.pdf)): A graph based method using inner product navigable small world graph.
* ip-NSW+ ([Paper](https://aaai.org/ojs/index.php/AAAI/article/view/5344/5200)): An enhancement of ip-NSW that introduces an additional angular proximity graph.
* Möbius-Graph ([Paper](https://proceedings.neurips.cc/paper/2019/file/0fd7e4f42a8b4b4ef33394d35212b13e-Paper.pdf)): A graph based method that reduces the MIPS problem to an NNS problem using Möbius transformation. Since the original code is not available, we implemented a version based on the paper.
* IPDG ([Paper](https://aclanthology.org/D19-1527.pdf)): A graph-based method that focuses on the top-1 MIPS problem and applies a heuristic pruning strategy by identifying extreme points, similar to the dominator definition in our work.
* NAPG ([Paper](https://dl.acm.org/doi/abs/10.1145/3447548.3467412)): A recent graph-based method claiming state-of-the-art performance by improving ip-NSW with a specialized metric, using an adaptive $\alpha$ for different norm distributions.
* Fargo ([Paper](https://www.vldb.org/pvldb/vol16/p1100-zheng.pdf)): The latest state-of-the-art LSH based method with theoretical guarantees.
* ScaNN ([Paper](http://proceedings.mlr.press/v119/guo20h/guo20h.pdf)): The state-of-the-art quantization method.

## 3 Datasets

The data format is: Number of vector (n) * Dimension (d).


| Dataset                                                      | Base Size   | Dim  | Query Size | Modality   |
| ------------------------------------------------------------ | ----------- | ---- | ---------- | ---------- |
| Netflix ([link](https://github.com/xinyandai/similarity-search/tree/mipsex/data/netflix)) | 17,770      | 300  | 1,000      | Video      |
| MNIST ([link](https://yann.lecun.com/exdb/mnist/index.html)) | 60,000      | 784  | 10,000     | Image      |
| YahooMusic ([link](https://www.cse.cuhk.edu.hk/systems/hash/gqr/dataset/yahoomusic.tar.gz)) | 136,736    | 300 | 1,000      | Audio       |
| UKBench ([link](https://www.cse.cuhk.edu.hk/systems/hash/gqr/dataset/ukbench.tar.gz)) | 1,097,907   | 128 | 1,000      | Image       |
| Music100 ([link](https://github.com/stanis-morozov/ip-nsw))  | 1,000,000   | 100  | 10,000     | Audio      |
| Text2Image1M ([link](https://research.yandex.com/blog/benchmarks-for-billion-scale-similarity-search)) | 1,000,000   | 200  | 100,000    | Multi  |
| MNIST1M ([link](https://leon.bottou.org/projects/infimnist)) | 1,000,000  | 784  | 10,000    | Image      |
| Deep10M        ([link](https://research.yandex.com/blog/benchmarks-for-billion-scale-similarity-search))  | 10,000,000 | 96   | 10,000     | Image |

## 4 Building Instruction

### Prerequisites

- GCC 4.9+ with OpenMP
- CMake 2.8+
- Boost 1.55+

### Compile On Linux

```shell
$ mkdir build/ && cd build/
$ cmake ..
$ make -j
```

## 5 Usage

### Code Structure

- **datasets**: datasets
- **include**: C++ class interface, main function implementation
- **script**: some scripts to run the experiments
- **test**: test codes
- **evaluation**: recall

### How to use

#### Step 1. Build Base Graph

Firstly, we need to prepare a base graph.  You can use Faiss and other libs.

#### Step 2. Indexing

```shell
./test/test_dsg MODE DATA_PATH BASE_GRAPH_PATH ARDG_PATH DATA_NUM M DIM EFC L THRESHOLD ANGLE
```

- `DATA_PATH` is the path of the base data in `bin` format.
- `BASE_GRAPH_PATH` is the path of the pre-built base graph in *Step 1.*.
- `ARDG_PATH` is the path of the generated ARDG index.
- `DATA_NUM` number of base vectors.
- `M` maximum out-degree.
- `DIM` dimension of dataset.
- `EFC` candidate pool size.
- `L`base graph degree.
- `Angle` minimal angle between edges.


#### Step 3. Searching

```shell
./test/test_dsg MODE DATA_PATH QUERY_PATH ARDG_PATH DATA_NUM QUERY_NUM K DIM searh_L RESULT_PATH 
```

- `DATA_PATH` is the path of the base data in `bin` format.
- `QUERY_PATH` is the path of the query data in `bin` format.
- `ARDG_PATH` is the path of the generated ARDG index.
- `DATA_NUM` number of base vectors.
- `QUERY_NUM` number of base vectors.
- `K` the result size.
- `DIM` dimension of dataset.
- `search_L` search pool size, the larger the better but slower (must larger than K).
- `RESULT_PATH ` is the path of the result neighbors.


## 6 Performance

#### Evaluation Metric

- QPS, Distance computation (for graph-based method)

![evaluation](./evaluation.png)