Summary: This tutorial provides implementation details for different FAISS index types and their optimal use cases in vector similarity search. It covers code implementations for Flat, IVF, HNSW, Scalar Quantizer, and Product Quantizer indexes, including specific parameter configurations and initialization syntax. The tutorial helps with tasks like choosing and implementing the right index type based on dataset size, memory constraints, and speed requirements. Key features include index-specific parameter tuning, performance trade-offs between speed/memory/accuracy, and a recall evaluation function, making it valuable for building efficient vector search systems.

# Choosing Index

Give a great amount of indexes and quantizers, how to choose the one in the experiment/application? In this part, we will give a general suggestion on how to choose the one fits your need.

## 0. Preparation

### Packages

For CPU usage, run:


```python
# %pip install -U faiss-cpu numpy h5py
```

For GPU on Linux x86_64 system, use Conda:

```conda install -c pytorch -c nvidia faiss-gpu=1.8.0```


```python
from urllib.request import urlretrieve
import h5py
import faiss
import numpy as np
```

### Dataset

In this tutorial, we'll use [SIFT1M](http://corpus-texmex.irisa.fr/), a very popular dataset for ANN evaluation, as our dataset to demonstrate the comparison.

Run the following cell to download the dataset or you can also manually download from the repo [ann-benchmarks](https://github.com/erikbern/ann-benchmarks?tab=readme-ov-file#data-sets))


```python
data_url = "http://ann-benchmarks.com/sift-128-euclidean.hdf5"
destination = "data.hdf5"
urlretrieve(data_url, destination)
```

Then load the data from the hdf5 file.


```python
with h5py.File('data.hdf5', 'r') as f:
    corpus = f['train'][:]
    query = f['test'][:]

print(corpus.shape, corpus.dtype)
print(query.shape, corpus.dtype)
```


```python
d = corpus[0].shape[0]
k = 100
```

### Helper function

The following is a helper function for computing recall.


```python
# compute recall from the prediction results and ground truth
def compute_recall(res, truth):
    recall = 0
    for i in range(len(res)):
        intersect = np.intersect1d(res[i], truth[i])
        recall += len(intersect) / len(res[i])
    recall /= len(res)

    return recall
```

## 1. Flat Index

Flat index use brute force to search neighbors for each query. It guarantees the optimal result with 100% recall. Thus we use the result from it as the ground truth.


```python
%%time
index = faiss.IndexFlatL2(d)
index.add(corpus)
```


```python
%%time
D, I_truth = index.search(query, k)
```

## 2. IVF Index


```python
%%time
nlist = 5
nprob = 3

quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist)
index.nprobe = nprob

index.train(corpus)
index.add(corpus)
```


```python
%%time
D, I = index.search(query, k)
```


```python
recall = compute_recall(I, I_truth)
print(f"Recall: {recall}")
```

From the test we can see that IVFFlatL2 has a pretty good promotion for the searching speed with a very tiny loss of recall.

## 3. HNSW Index


```python
%%time
M = 64
ef_search = 32
ef_construction = 64

index = faiss.IndexHNSWFlat(d, M)
# set the two parameters before adding data
index.hnsw.efConstruction = ef_construction
index.hnsw.efSearch = ef_search

index.add(corpus)
```


```python
%%time
D, I = index.search(query, k)
```


```python
recall = compute_recall(I, I_truth)
print(f"Recall: {recall}")
```

From the searching time of less than 1 second, we can see why HNSW is one of the best choice when looking for an extreme speed during searching phase. The reduction of recall is acceptable. But the  longer time during creation of index and large memory footprint need to be considered.

## 4. LSH


```python
%%time
nbits = d * 8

index = faiss.IndexLSH(d, nbits)
index.train(corpus)
index.add(corpus)
```


```python
%%time
D, I = index.search(query, k)
```


```python
recall = compute_recall(I, I_truth)
print(f"Recall: {recall}")
```

As we covered in the last notebook, LSH is not a good choice when the data dimension is large. Here 128 is already burdened for LSH. As we can see, even we choose a relatively small `nbits` of d * 8, the index creating time and search time are still pretty long. And the recall of about 58.6% is not satisfactory.

## 5. Scalar Quantizer Index


```python
%%time
qtype = faiss.ScalarQuantizer.QT_8bit
metric = faiss.METRIC_L2

index = faiss.IndexScalarQuantizer(d, qtype, metric)
index.train(corpus)
index.add(corpus)
```


```python
%%time
D, I = index.search(query, k)
```


```python
recall = compute_recall(I, I_truth)
print(f"Recall: {recall}")
```

Here scalar quantizer index's performance looks very similar to the Flat index. Because the elements of vectors in the SIFT dataset are integers in the range of [0, 218]. Thus the index does not lose to much information during scalar quantization. For the dataset with more complex distribution in float32. The difference will be more obvious.

## 6. Product Quantizer Index


```python
%%time
M = 16
nbits = 8
metric = faiss.METRIC_L2

index = faiss.IndexPQ(d, M, nbits, metric)

index.train(corpus)
index.add(corpus)
```


```python
%%time
D, I = index.search(query, k)
```


```python
recall = compute_recall(I, I_truth)
print(f"Recall: {recall}")
```

Product quantizer index is not standout in any one of the aspect. But it somewhat balance the tradeoffs. It is widely used in real applications with the combination of other indexes such as IVF or HNSW.
