# Condensed: Evaluate the Fine-tuned Model

Summary: This tutorial demonstrates the implementation of a neural search and evaluation system using FlagEmbedding and Faiss. It covers techniques for efficient similarity search implementation, batch processing of embeddings, and model evaluation workflows. The code can help with tasks like building search systems, evaluating search quality using metrics like MRR, and comparing model performances. Key features include data loading from JSON files, vector similarity search with Faiss, batch processing for memory efficiency, calculation of evaluation metrics, and comparison between base and fine-tuned models. The tutorial is particularly useful for implementing and evaluating neural search systems with both raw and fine-tuned embedding models.

*This is a condensed version that preserves essential implementation details and context.*

Here's the condensed tutorial focusing on essential implementation details:

# Model Evaluation Tutorial

## Key Dependencies
```python
pip install -U datasets pytrec_eval FlagEmbedding
```

## Implementation Steps

### 1. Data Loading
```python
from datasets import load_dataset

# Load test data
queries = load_dataset("json", data_files="ft_data/test_queries.jsonl")["train"]
corpus = load_dataset("json", data_files="ft_data/corpus.jsonl")["train"]
qrels = load_dataset("json", data_files="ft_data/test_qrels.jsonl")["train"]

# Extract text
queries_text = queries["text"]
corpus_text = [text for sub in corpus["text"] for text in sub]

# Create relevance dictionary
qrels_dict = {}
for line in qrels:
    if line['qid'] not in qrels_dict:
        qrels_dict[line['qid']] = {}
    qrels_dict[line['qid']][line['docid']] = line['relevance']
```

### 2. Search Implementation
```python
import faiss
import numpy as np
from tqdm import tqdm

def search(model, queries_text, corpus_text):
    # Encode queries and corpus
    queries_embeddings = model.encode_queries(queries_text)
    corpus_embeddings = model.encode_corpus(corpus_text)
    
    # Initialize Faiss index
    dim = corpus_embeddings.shape[-1]
    index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)
    corpus_embeddings = corpus_embeddings.astype(np.float32)
    index.train(corpus_embeddings)
    index.add(corpus_embeddings)
    
    # Batch search (32 queries at a time)
    query_size = len(queries_embeddings)
    all_scores, all_indices = [], []
    
    for i in tqdm(range(0, query_size, 32), desc="Searching"):
        j = min(i + 32, query_size)
        query_embedding = queries_embeddings[i: j]
        score, indice = index.search(query_embedding.astype(np.float32), k=100)
        all_scores.append(score)
        all_indices.append(indice)

    # Format results
    results = {}
    for idx, (scores, indices) in enumerate(zip(np.concatenate(all_scores, axis=0), 
                                              np.concatenate(all_indices, axis=0))):
        results[queries["id"][idx]] = {
            corpus["id"][index]: float(score) 
            for score, index in zip(scores, indices) if index != -1
        }
    return results
```

### 3. Model Evaluation
```python
from FlagEmbedding.abc.evaluation.utils import evaluate_metrics, evaluate_mrr
from FlagEmbedding import FlagModel

# Configuration
k_values = [10,100]
raw_name = "BAAI/bge-large-en-v1.5"
finetuned_path = "test_encoder_only_base_bge-large-en-v1.5"

# Model initialization parameters
model_params = {
    "query_instruction_for_retrieval": "Represent this sentence for searching relevant passages:",
    "devices": [0],
    "use_fp16": False
}

# Evaluate both models
for model_path in [raw_name, finetuned_path]:
    model = FlagModel(model_path, **model_params)
    results = search(model, queries_text, corpus_text)
    eval_res = evaluate_metrics(qrels_dict, results, k_values)
    mrr = evaluate_mrr(qrels_dict, results, k_values)
```

## Important Notes
- Uses Faiss for efficient similarity search
- Processes queries in batches of 32 for memory efficiency
- Evaluates top 100 results for each query
- Compares metrics between original and fine-tuned models
- Uses inner product metric for similarity calculation