## Overview

This module provides a baseline for our experiments. The process involves several steps:

1. Generating embeddings from the text corpus using a pretrained model and 
   saving them in a datastore.
2. Training a clustering model on the datastore with the token embeddings.
3. Mapping new text data to cluster centroids using the pretrained clustering and embedding models.
4. Building a BM25 index using the mapped (preprocessed) corpus (JSON files).
5. Retrieving documents using the BM25 index and scoring the results.

Each step is implemented in a separate script.

The following sections provide detailed instructions on how to run each script and what parameters to use.

Notes: I had to use numpy-1.24.0 to avoid a dependency conflict with the `faiss` library. 

## Steps

### 1. Generate Embeddings

The first step is to generate embeddings from the text corpus. The script 
`build_datastore.py` is used for this purpose. It takes a list of sentences 
and generates embeddings for each token in the sentences using a pretrained 
model.

To run the script, use the following command:

```bash
python scripts/baseline_kmeans/build_datastore.py \
  --model_name <model_name> \
  --memmap_file <memmap_file> \
  --dstore_size <dstore_size> \
  --max_length <max_length> \
  --batch_size <batch_size> \
  --device <device> \
  --use_fp16
```

Replace the placeholders with appropriate values. For example:

- `<model_name>`: The name of the pre-trained model to use for generating embeddings.
- `<memmap_file>`: The file path where the generated embeddings will be stored.
- `<dstore_size>`: The total number of embeddings to store.
- `<max_length>`: The maximum length of input sequences.
- `<batch_size>`: The number of sentences to process at a time.
- `<device>`: The device to run the model on ('cuda' or 'cpu').

### 2. Train Clustering Model

The next step is to train a clustering model on the generated embeddings. The script `train_kmeans.py` is used for this purpose. It takes the embeddings and trains a KMeans clustering model on them.

To run the script, use the following command:

```bash
python scripts/baseline_kmeans/train_kmeans.py \
  --memmap_file <memmap_file> \
  --dstore_size <dstore_size> \
  --dimension <dimension> \
  --model <model> \
  --num_clusters <num_clusters> \
  --max_iter <max_iter> \
  --n_init <n_init> \
  --tolerance <tolerance> \
  --random_state <random_state> \
  --model_path <model_path> \
  --use_gpu
```

Replace the placeholders with appropriate values. For example:

- `<memmap_file>`: The file path where the generated embeddings are stored.
- `<dstore_size>`: The total number of embeddings stored.
- `<dimension>`: The dimension of the embeddings.
- `<model>`: The KMeans implementation to use.
- `<num_clusters>`: The number of clusters for KMeans.
- `<max_iter>`: The maximum number of iterations for KMeans.
- `<n_init>`: The number of initializations for KMeans. When >1, the run is 
  selected.
- `<tolerance>`: The relative tolerance to declare convergence.
- `<random_state>`: The random seed for reproducibility.
- `<model_path>`: The file path to save the trained KMeans model.

### 3. Map Text Corpus to Cluster Centroids

The final step involves feeding a new text corpus into a script. This script uses the pre-trained clustering model and the same pre-trained embedding model to map the input tokens to embeddings. Then, it maps the embeddings to cluster centroids using the pre-trained clustering model. The final output will be a representation of each token in the text corpus as a cluster ID.

To run the script, use the following command:

```bash
python scripts/baseline_kmeans/prepare_text_for_index.py \
  --model_name <model_name> \
  --clustering_model <clustering_model> \
  --clustering_model_path <clustering_model_path> \
  --index_path <index_path> \
  --max_length <max_length> \
  --batch_size <batch_size> \
  --device <device>
```

Replace the placeholders with appropriate values. For example:
 - `<model_name>`: The name of the pre-trained model to use for generating 
embeddings.
 - `<clustering_model>`: The name of the clustering model to use.
 - `<clustering_model_path>`: The file path to the trained clustering model.
 - `<index_path>`: The path to save the BM25 index files.
 - `<max_length>`: The maximum length of input sequences.
 - `<batch_size>`: The number of sentences to process at a time.
 - `<device>`: The device to run the model on ('cuda' or 'cpu').


### 4. Build the BM25 Index

After mapping the text corpus to cluster centroids, the next step is to 
build the BM25 index. This is done using the `build_bm25.sh` script. This 
script processes multiple preprocessed collections, using the parent folders 
containing the JSON files from the previous step, and creates BM25 indexes 
for each one.

To run the script, use the following command:

```bash
./scripts/baseline_kmeans/build_bm25.sh <path1> <path2> ...
```

Sure, based on the provided code and the existing README, the final step involves retrieving documents using the BM25 index and scoring the results. This is done using the `retrieve_with_bm25.py` script. Here is the updated section for the README file:


### 5. Retrieve and Score Documents

The final step involves retrieving documents using the BM25 index and scoring the results. This is done using the `retrieve_with_bm25.py` script. This script takes the encoded queries and retrieves the top 100 documents for each query from the BM25 index. It then scores the results using the gold standard qrels.

To run the script, use the following command:

```bash
python scripts/baseline_kmeans/retrieve_with_bm25.py \
  --model <model> \
  --clustering_model <clustering_model> \
  --clustering_model_path <clustering_model_path> \
  --dataset <dataset> \
  --index_path <index_path> \
  --output_path <output_path> \
  --limit <limit> \
  --batch_size <batch_size> \
  --query_formatting <query_formatting> \
  --device <device> \
  --max_length <max_length> \
  --encoding <encoding>
```

Replace the placeholders with appropriate values. For example:

- `<model>`: The name of the pre-trained model to use for generating embeddings.
- `<clustering_model>`: The name of the clustering model to use.
- `<clustering_model_path>`: The file path to the trained clustering model.
- `<dataset>`: The dataset split to evaluate on, e.g., 'msmarco-passage/dev'.
- `<index_path>`: The path to the BM25 index created from discretized documents.
- `<output_path>`: The directory to save the encoded queries, qrels, and scores.
- `<limit>`: Limit the number of queries to process.
- `<batch_size>`: The number of sentences per batch.
- `<query_formatting>`: Formatting applied to each query before tokenization.
- `<device>`: The device to run the model on ('cuda' or 'cpu').
- `<max_length>`: The maximum length of input sequences.
- `<encoding>`: Encoding method for the queries.

Please note that the script `retrieve_with_bm25.py` should be run after the BM25 index has been built.
