# Condensed: Evaluate on MLDR

Summary: This tutorial demonstrates how to implement and evaluate dense retrieval on the MLDR dataset using FlagEmbedding. It covers essential techniques for text embedding generation using BAAI/bge-base models, efficient similarity search with FAISS indexing, and evaluation using pytrec_eval. Key functionalities include dataset loading, batch processing of embeddings, vector similarity search, and computing standard IR metrics (NDCG, Recall). The tutorial helps with tasks like setting up a dense retrieval pipeline, implementing efficient search mechanisms, and evaluating retrieval performance. It also provides a simplified alternative using FlagEmbedding's built-in evaluation tools, making it valuable for both detailed custom implementations and quick evaluations.

*This is a condensed version that preserves essential implementation details and context.*

Here's the condensed tutorial focusing on essential implementation details for evaluating on MLDR:

# MLDR Evaluation Implementation Guide

## Key Setup
```python
pip install FlagEmbedding pytrec_eval
```

## Core Implementation Steps

### 1. Load Dataset
```python
from datasets import load_dataset

# Load queries and corpus
lang = "en"
dataset = load_dataset('Shitao/MLDR', lang, trust_remote_code=True)
corpus = load_dataset('Shitao/MLDR', f"corpus-{lang}", trust_remote_code=True)['corpus']

# Prepare IDs and text
corpus_ids = corpus['docid']
corpus_text = corpus['text']
queries_ids = dataset['dev']['query_id']
queries_text = dataset['dev']['query']
```

### 2. Generate Embeddings
```python
from FlagEmbedding import FlagModel

model = FlagModel('BAAI/bge-base-en-v1.5')
queries_embeddings = model.encode_queries(queries_text)
corpus_embeddings = model.encode_corpus(corpus_text)
```

### 3. Create FAISS Index
```python
import faiss
import numpy as np

dim = corpus_embeddings.shape[-1]
index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)
corpus_embeddings = corpus_embeddings.astype(np.float32)
index.train(corpus_embeddings)
index.add(corpus_embeddings)
```

### 4. Search Implementation
```python
# Batch search through queries
for i in range(0, query_size, 32):
    j = min(i + 32, query_size)
    query_embedding = queries_embeddings[i: j]
    score, indice = index.search(query_embedding.astype(np.float32), k=100)
    # Store results
    all_scores.append(score)
    all_indices.append(indice)
```

### 5. Evaluation
```python
import pytrec_eval

# Configure evaluation metrics
ndcg_string = "ndcg_cut." + ",".join([str(k) for k in [10,100]])
recall_string = "recall." + ",".join([str(k) for k in [10,100]])

evaluator = pytrec_eval.RelevanceEvaluator(
    qrels_dict, {ndcg_string, recall_string}
)
scores = evaluator.evaluate(results)
```

## Important Configurations

- Model: `bge-base-en-v1.5`
- Batch size: 32 for search
- Top-k retrieval: 100
- Evaluation metrics: NDCG@10, NDCG@100, Recall@10, Recall@100

## Best Practices

1. Use float32 for embeddings when working with FAISS
2. Process queries in batches for efficient searching
3. Cache embeddings when possible for large datasets
4. Use inner product metric for similarity search

## Alternative Quick Evaluation

Use FlagEmbedding's built-in evaluation:

```python
from FlagEmbedding.evaluation.mldr import MLDREvalRunner

eval_args = MLDREvalArgs(
    eval_name="mldr",
    dataset_names=["en"],
    splits=["dev"],
    k_values=[10, 100]
)
model_args = MLDREvalModelArgs(
    embedder_name_or_path="BAAI/bge-base-en-v1.5",
    embedder_batch_size=1024
)

runner = MLDREvalRunner(eval_args, model_args)
runner.run()
```