Summary: This tutorial demonstrates the implementation of a neural search and evaluation system using FlagEmbedding and Faiss. It covers techniques for efficient similarity search implementation, batch processing of embeddings, and model evaluation workflows. The code can help with tasks like building search systems, evaluating search quality using metrics like MRR, and comparing model performances. Key features include data loading from JSON files, vector similarity search with Faiss, batch processing for memory efficiency, calculation of evaluation metrics, and comparison between base and fine-tuned models. The tutorial is particularly useful for implementing and evaluating neural search systems with both raw and fine-tuned embedding models.

# Evaluate the Fine-tuned Model

In the previous sections, we prepared the dataset and fine-tuned the model. In this tutorial, we will go through how to evaluate the model with the test dataset we constructed.

## 0. Installation


```python
% pip install -U datasets pytrec_eval FlagEmbedding
```

## 1. Load Data

We first load data from the files we processed.


```python
from datasets import load_dataset

queries = load_dataset("json", data_files="ft_data/test_queries.jsonl")["train"]
corpus = load_dataset("json", data_files="ft_data/corpus.jsonl")["train"]
qrels = load_dataset("json", data_files="ft_data/test_qrels.jsonl")["train"]

queries_text = queries["text"]
corpus_text = [text for sub in corpus["text"] for text in sub]
```


```python
qrels_dict = {}
for line in qrels:
    if line['qid'] not in qrels_dict:
        qrels_dict[line['qid']] = {}
    qrels_dict[line['qid']][line['docid']] = line['relevance']
```

## 2. Search

Then we prepare a function to encode the text into embeddings and search the results:


```python
import faiss
import numpy as np
from tqdm import tqdm


def search(model, queries_text, corpus_text):
    
    queries_embeddings = model.encode_queries(queries_text)
    corpus_embeddings = model.encode_corpus(corpus_text)
    
    # create and store the embeddings in a Faiss index
    dim = corpus_embeddings.shape[-1]
    index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)
    corpus_embeddings = corpus_embeddings.astype(np.float32)
    index.train(corpus_embeddings)
    index.add(corpus_embeddings)
    
    query_size = len(queries_embeddings)

    all_scores = []
    all_indices = []

    # search top 100 answers for all the queries
    for i in tqdm(range(0, query_size, 32), desc="Searching"):
        j = min(i + 32, query_size)
        query_embedding = queries_embeddings[i: j]
        score, indice = index.search(query_embedding.astype(np.float32), k=100)
        all_scores.append(score)
        all_indices.append(indice)

    all_scores = np.concatenate(all_scores, axis=0)
    all_indices = np.concatenate(all_indices, axis=0)
    
    # store the results into the format for evaluation
    results = {}
    for idx, (scores, indices) in enumerate(zip(all_scores, all_indices)):
        results[queries["id"][idx]] = {}
        for score, index in zip(scores, indices):
            if index != -1:
                results[queries["id"][idx]][corpus["id"][index]] = float(score)
                
    return results
```

## 3. Evaluation


```python
from FlagEmbedding.abc.evaluation.utils import evaluate_metrics, evaluate_mrr
from FlagEmbedding import FlagModel

k_values = [10,100]

raw_name = "BAAI/bge-large-en-v1.5"
finetuned_path = "test_encoder_only_base_bge-large-en-v1.5"
```

The result for the original model:


```python
raw_model = FlagModel(
    raw_name, 
    query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
    devices=[0],
    use_fp16=False
)

results = search(raw_model, queries_text, corpus_text)

eval_res = evaluate_metrics(qrels_dict, results, k_values)
mrr = evaluate_mrr(qrels_dict, results, k_values)

for res in eval_res:
    print(res)
print(mrr)
```

Then the result for the model after fine-tuning:


```python
ft_model = FlagModel(
    finetuned_path, 
    query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
    devices=[0],
    use_fp16=False
)

results = search(ft_model, queries_text, corpus_text)

eval_res = evaluate_metrics(qrels_dict, results, k_values)
mrr = evaluate_mrr(qrels_dict, results, k_values)

for res in eval_res:
    print(res)
print(mrr)
```

We can see an obvious improvement in all the metrics.
