Summary: This tutorial demonstrates how to prepare data for fine-tuning embedding models, specifically focusing on information retrieval tasks. It covers essential implementation techniques for loading datasets, formatting training data with positive and negative examples, and preparing evaluation data. Key functionalities include dataset transformation using the 'datasets' library, generating negative samples, adding prompts for retrieval, and creating evaluation components (queries, corpus, and relevance relationships). The tutorial helps with tasks like structuring training data in the required format {query, pos, neg, prompt}, splitting datasets, and preparing evaluation metrics. It's particularly useful for implementing embedding model fine-tuning pipelines and information retrieval systems.

# Data Preparation for Fine-tuning

In this tutorial, we will show an example of the first step for fine-tuning: dataset preparation.

## 0. Installation


```python
% pip install -U datasets
```

Suppose we are willing to fine-tune our model for financial tasks. We found an open-source dataset that could be useful: [financial-qa-10k](https://huggingface.co/datasets/virattt/financial-qa-10K). Let's see how to properly prepare our dataset for fine-tuning.

The raw dataset has the following structure:
- 5 columns of: 'question', 'answer', 'context', 'ticker', and 'filing'.
- 7000 rows.


```python
from datasets import load_dataset

ds = load_dataset("virattt/financial-qa-10K", split="train")
ds
```

## 1. Data for Fine-tuning

Construct the dataset to the following format:

``` python
{"query": str, "pos": List[str], "neg":List[str], "pos_scores": List[int], "neg_scores": List[int], "prompt": str, "type": str}
```

`query` is the query, and `pos` is a list of positive texts, `neg` is a list of negative texts. `pos_scores` is a list of scores corresponding to the query and pos, `neg_scores` is a list of scores corresponding to the `query` and `neg`, if you don't use knowledge distillation, it can be ignored. `prompt` is the prompt used for the query, it will cover query_instruction_for_retrieval. `type` is used for bge-en-icl, it includes `normal`, `symmetric_class`, `symmetric_clustering`, .etc. If you have no negative texts for a query, you can random sample some from the entire corpus as the negatives.

We select the columns 'question' and 'context' as our query and answer(pos), and rename the columns. Then add the 'id' column for later evaluation use.


```python
ds = ds.select_columns(column_names=["question", "context"])
ds = ds.rename_column("question", "query")
ds = ds.rename_column("context", "pos")
ds = ds.add_column("id", [str(i) for i in range(len(ds))])
ds[0]
```

Negative examples are important during the training of embedding models. Our initial dataset does not come with negative texts. Thus we directly sample a few from the whole corpus.


```python
import numpy as np

np.random.seed(520)
neg_num = 10

def str_to_lst(data):
    data["pos"] = [data["pos"]]
    return data

# sample negative texts
new_col = []
for i in range(len(ds)):
    ids = np.random.randint(0, len(ds), size=neg_num)
    while i in ids:
        ids = np.random.randint(0, len(ds), size=neg_num)
    neg = [ds[i.item()]["pos"] for i in ids]
    new_col.append(neg)
ds = ds.add_column("neg", new_col)

# change the key of 'pos' to a list
ds = ds.map(str_to_lst)
```

Lastly, we add the prompt which is used for query. It will be the `query_instruction_for_retrieval` during inference.


```python
instruction = "Represent this sentence for searching relevant passages: "
ds = ds.add_column("prompt", [instruction]*len(ds))
```

Now a single row of the dataset is:


```python
ds[0]
```

Then we split the dataset into training set and testing set.


```python
split = ds.train_test_split(test_size=0.1, shuffle=True, seed=520)
train = split["train"]
test = split["test"]
```

Now we are ready to store the data for later fine-tuning:


```python
train.to_json("ft_data/training.json")
```

## 2. Test Data for Evaluation

The last step is to construct the testing dataset for evaluaton.


```python
test
```

First select the columns for queries:


```python
queries = test.select_columns(column_names=["id", "query"])
queries = queries.rename_column("query", "text")
queries[0]
```

Then select the columns for corpus:


```python
corpus = ds.select_columns(column_names=["id", "pos"])
corpus = corpus.rename_column("pos", "text")
```

Finally, make the qrels that indicating the relations of queries and corresponding corpus"


```python
qrels = test.select_columns(["id"])
qrels = qrels.rename_column("id", "qid")
qrels = qrels.add_column("docid", list(test["id"]))
qrels = qrels.add_column("relevance", [1]*len(test))
qrels[0]
```

Store the training set


```python
queries.to_json("ft_data/test_queries.jsonl")
corpus.to_json("ft_data/corpus.jsonl")
qrels.to_json("ft_data/test_qrels.jsonl")
```
