# Condensed: Evaluate on BEIR

Summary: This tutorial demonstrates two approaches for implementing BEIR (Benchmarking IR) evaluations: direct BEIR evaluation and FlagEmbedding evaluation. It provides code for dataset loading, model initialization, and retrieval evaluation using dense retrieval methods. Key functionalities include setting up logging, downloading and processing BEIR datasets, configuring evaluation parameters, and computing metrics like NDCG, MAP, recall, and precision. The tutorial is particularly useful for tasks involving information retrieval evaluation, dense passage retrieval, and benchmark testing of embedding models, with specific implementation details for both BEIR's native evaluation pipeline and the FlagEmbedding framework.

*This is a condensed version that preserves essential implementation details and context.*

Here's the condensed tutorial focusing on essential implementation details:

# BEIR Evaluation Guide

## Key Components
1. BEIR installation and setup
2. Direct BEIR evaluation
3. FlagEmbedding evaluation method

## Installation
```bash
pip install beir FlagEmbedding
```

## 1. Direct BEIR Evaluation

### Setup Logging
```python
import logging
from beir import LoggingHandler

logging.basicConfig(format='%(message)s',
                   level=logging.INFO,
                   handlers=[LoggingHandler()])
```

### Load Dataset
```python
from beir import util
from beir.datasets.data_loader import GenericDataLoader

# Download dataset
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/arguana.zip"
data_path = util.download_and_unzip(url, "data")

# Load dataset
corpus, queries, qrels = GenericDataLoader("data/arguana").load(split="test")
```

### Run Evaluation
```python
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval import models
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

# Initialize model and retriever
model = DRES(models.SentenceBERT("BAAI/bge-base-en-v1.5"), batch_size=128)
retriever = EvaluateRetrieval(model, score_function="cos_sim")

# Retrieve and evaluate
results = retriever.retrieve(corpus, queries)
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
```

## 2. FlagEmbedding Evaluation

### Configuration
```python
eval_args = {
    "eval_name": "beir",
    "dataset_dir": "./beir/data",
    "dataset_names": ["arguana"],
    "splits": ["test", "dev"],
    "corpus_embd_save_dir": "./beir/corpus_embd",
    "output_dir": "./beir/search_results",
    "search_top_k": 1000,
    "rerank_top_k": 100,
    "k_values": [10, 100],
    "eval_metrics": ["ndcg_at_10", "recall_at_100"],
    "embedder_name_or_path": "BAAI/bge-base-en-v1.5",
    "embedder_batch_size": 1024
}
```

### Run Evaluation
```python
from transformers import HfArgumentParser
from FlagEmbedding.evaluation.beir import BEIREvalArgs, BEIREvalModelArgs, BEIREvalRunner

parser = HfArgumentParser((BEIREvalArgs, BEIREvalModelArgs))
eval_args, model_args = parser.parse_args_into_dataclasses()

runner = BEIREvalRunner(eval_args=eval_args, model_args=model_args)
runner.run()
```

## Important Notes
- BEIR contains 18 datasets (14 public, 4 private)
- Use appropriate batch sizes based on available memory
- Results are stored in JSON format under search_results directory
- Private datasets require special licenses
- Evaluation metrics include NDCG, MAP, Recall, and Precision

This condensed version maintains all critical implementation details while removing redundant explanations and verbose descriptions.