Summary: This tutorial demonstrates implementing fast similarity search using Faiss indexing with FlagEmbedding. It provides implementation knowledge for creating, configuring, and using Faiss indices with vector embeddings, including CPU/GPU setup, data type handling, and index persistence. The tutorial helps with tasks like building efficient vector search systems, managing large-scale embeddings, and performing similarity queries. Key features covered include embedding generation using FlagModel, Faiss index creation and configuration, GPU support implementation, index saving/loading functionality, and performing vector similarity searches with customizable result counts.

# Indexing Using Faiss

In practical cases, datasets contain thousands or millions of rows. Looping through the whole corpus to find the best answer to a query is very time and space consuming. In this tutorial, we'll introduce how to use indexing to make our retrieval fast and neat.

## Step 0: Setup

Install the dependencies in the environment.


```python
%pip install -U FlagEmbedding
```

### faiss-gpu on Linux (x86_64)

Faiss maintain the latest updates on conda. So if you have GPUs on Linux x86_64, create a conda virtual environment and run:

```conda install -c pytorch -c nvidia faiss-gpu=1.8.0```

and make sure you select that conda env as the kernel for this notebook.

### faiss-cpu

Otherwise it's simple, just run the following cell to install `faiss-cpu`


```python
%pip install -U faiss-cpu
```

## Step 1: Dataset

Below is a super tiny courpus with only 10 sentences, which will be the dataset we use.

Each sentence is a concise discription of a famous people in specific domain.


```python
corpus = [
    "Michael Jackson was a legendary pop icon known for his record-breaking music and dance innovations.",
    "Fei-Fei Li is a professor in Stanford University, revolutionized computer vision with the ImageNet project.",
    "Brad Pitt is a versatile actor and producer known for his roles in films like 'Fight Club' and 'Once Upon a Time in Hollywood.'",
    "Geoffrey Hinton, as a foundational figure in AI, received Turing Award for his contribution in deep learning.",
    "Eminem is a renowned rapper and one of the best-selling music artists of all time.",
    "Taylor Swift is a Grammy-winning singer-songwriter known for her narrative-driven music.",
    "Sam Altman leads OpenAI as its CEO, with astonishing works of GPT series and pursuing safe and beneficial AI.",
    "Morgan Freeman is an acclaimed actor famous for his distinctive voice and diverse roles.",
    "Andrew Ng spread AI knowledge globally via public courses on Coursera and Stanford University.",
    "Robert Downey Jr. is an iconic actor best known for playing Iron Man in the Marvel Cinematic Universe.",
]
```

And a few queries (add your own queries and check the result!): 


```python
queries = [
    "Who is Robert Downey Jr.?",
    "An expert of neural network",
    "A famous female singer",
]
```

## Step 2: Text Embedding

Here, for the sake of speed, we just embed the first 500 docs in the corpus.


```python
from FlagEmbedding import FlagModel

# get the BGE embedding model
model = FlagModel('BAAI/bge-base-en-v1.5',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)

# get the embedding of the corpus
corpus_embeddings = model.encode(corpus)

print("shape of the corpus embeddings:", corpus_embeddings.shape)
print("data type of the embeddings: ", corpus_embeddings.dtype)
```

Faiss only accepts float32 inputs.

So make sure the dtype of corpus_embeddings is float32 before adding them to the index.


```python
import numpy as np

corpus_embeddings = corpus_embeddings.astype(np.float32)
```

## Step 3: Indexing

In this step, we build an index and add the embedding vectors to it.


```python
import faiss

# get the length of our embedding vectors, vectors by bge-base-en-v1.5 have length 768
dim = corpus_embeddings.shape[-1]

# create the faiss index and store the corpus embeddings into the vector space
index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)

# if you installed faiss-gpu, uncomment the following lines to make the index on your GPUs.

# co = faiss.GpuMultipleClonerOptions()
# index = faiss.index_cpu_to_all_gpus(index, co)
```

No need to train if we use "Flat" quantizer and METRIC_INNER_PRODUCT as metric. Some other indices that using quantization might need training.


```python
# check if the index is trained
print(index.is_trained)  
# index.train(corpus_embeddings)

# add all the vectors to the index
index.add(corpus_embeddings)

print(f"total number of vectors: {index.ntotal}")
```

### Step 3.5 (Optional): Saving Faiss index

Once you have your index with the embedding vectors, you can save it locally for future usage.


```python
# change the path to where you want to save the index
path = "./index.bin"
faiss.write_index(index, path)
```

If you already have stored index in your local directory, you can load it by:


```python
index = faiss.read_index("./index.bin")
```

## Step 4: Find answers to the query

First, get the embeddings of all the queries:


```python
query_embeddings = model.encode_queries(queries)
```

Then, use the Faiss index to do a knn search in the vector space:


```python
dists, ids = index.search(query_embeddings, k=3)
print(dists)
print(ids)
```

Now let's see the result:


```python
for i, q in enumerate(queries):
    print(f"query:\t{q}\nanswer:\t{corpus[ids[i][0]]}\n")
```
