# Condensed: Evaluation

Summary: This tutorial demonstrates how to build and evaluate a text embedding and retrieval system using FlagEmbedding and FAISS. It covers implementation techniques for generating embeddings with the BGE model, creating FAISS indexes for efficient similarity search, and evaluating retrieval performance using multiple metrics (Recall, MRR, nDCG). The tutorial helps with tasks like vector similarity search, batch processing of queries, and implementing evaluation metrics for information retrieval systems. Key features include FP16 optimization, FAISS index management, batch processing for large datasets, and comprehensive evaluation metric implementations with configurable cut-off points.

*This is a condensed version that preserves essential implementation details and context.*

Here's the condensed tutorial focusing on essential implementation details:

# Embedding Model Evaluation Pipeline

## Key Setup & Dependencies
```python
%pip install -U FlagEmbedding faiss-cpu
```

## Implementation Steps

### 1. Data Loading
```python
from datasets import load_dataset
import numpy as np

# Load dataset (using truncated version for demo)
data = load_dataset("namespace-Pt/msmarco", split="dev")
queries = np.array(data[:100]["query"])
corpus = sum(data[:5000]["positive"], [])
```

### 2. Embedding Generation
```python
from FlagEmbedding import FlagModel

model = FlagModel('BAAI/bge-base-en-v1.5',
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=True)

corpus_embeddings = model.encode(corpus)
```

### 3. FAISS Index Creation
```python
import faiss

dim = corpus_embeddings.shape[-1]
index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)
corpus_embeddings = corpus_embeddings.astype(np.float32)
index.train(corpus_embeddings)
index.add(corpus_embeddings)

# Optional: Save/Load index
# faiss.write_index(index, "index.bin")
# index = faiss.read_index("index.bin")
```

### 4. Retrieval Implementation
```python
query_embeddings = model.encode_queries(queries)
ground_truths = [d["positive"] for d in data]
corpus = np.asarray(corpus)

# Search parameters
cut_offs = [1, 10]
k = max(cut_offs)
batch_size = 256

# Batch processing for search
for i in range(0, len(query_embeddings), batch_size):
    q_embedding = query_embeddings[i:min(i+batch_size, len(query_embeddings))].astype(np.float32)
    score, idx = index.search(q_embedding, k=k)
```

### 5. Evaluation Metrics

#### Recall Implementation
```python
def calc_recall(preds, truths, cutoffs):
    recalls = np.zeros(len(cutoffs))
    for text, truth in zip(preds, truths):
        for i, c in enumerate(cutoffs):
            recall = np.intersect1d(truth, text[:c])
            recalls[i] += len(recall) / max(min(c, len(truth)), 1)
    recalls /= len(preds)
    return recalls
```

#### MRR Implementation
```python
def MRR(preds, truth, cutoffs):
    mrr = [0 for _ in range(len(cutoffs))]
    for pred, t in zip(preds, truth):
        for i, c in enumerate(cutoffs):
            for j, p in enumerate(pred):
                if j < c and p in t:
                    mrr[i] += 1/(j+1)
                    break
    mrr = [k/len(preds) for k in mrr]
    return mrr
```

#### nDCG Calculation
```python
from sklearn.metrics import ndcg_score

# Convert predictions to binary encodings
pred_hard_encodings = [list(np.isin(pred, label).astype(int)) 
                      for pred, label in zip(res_text, ground_truths)]

# Calculate nDCG for each cutoff
for c in cut_offs:
    nDCG = ndcg_score(pred_hard_encodings, res_scores, k=c)
```

## Important Notes
- Use FP16 for efficiency with `use_fp16=True`
- Process large datasets in batches (batch_size=256)
- Consider saving FAISS index for reuse
- Evaluation metrics are calculated at different cut-off points (1, 10)
- Use appropriate data types (np.float32) for FAISS compatibility