# Condensed: Data Preparation for Fine-tuning

Summary: This tutorial demonstrates how to prepare data for fine-tuning embedding models, specifically focusing on information retrieval tasks. It covers essential implementation techniques for loading datasets, formatting training data with positive and negative examples, and preparing evaluation data. Key functionalities include dataset transformation using the 'datasets' library, generating negative samples, adding prompts for retrieval, and creating evaluation components (queries, corpus, and relevance relationships). The tutorial helps with tasks like structuring training data in the required format {query, pos, neg, prompt}, splitting datasets, and preparing evaluation metrics. It's particularly useful for implementing embedding model fine-tuning pipelines and information retrieval systems.

*This is a condensed version that preserves essential implementation details and context.*

Here's the condensed tutorial focusing on essential implementation details:

# Data Preparation for Fine-tuning

## Key Requirements
```python
pip install -U datasets
```

## Implementation Steps

### 1. Load and Format Dataset
```python
from datasets import load_dataset

# Load dataset
ds = load_dataset("virattt/financial-qa-10K", split="train")

# Required format for fine-tuning:
{
    "query": str,
    "pos": List[str],
    "neg": List[str],
    "pos_scores": List[int],  # Optional for knowledge distillation
    "neg_scores": List[int],  # Optional for knowledge distillation
    "prompt": str,
    "type": str
}
```

### 2. Prepare Training Data
```python
# Rename and select columns
ds = ds.select_columns(["question", "context"])
ds = ds.rename_column("question", "query")
ds = ds.rename_column("context", "pos")
ds = ds.add_column("id", [str(i) for i in range(len(ds))])

# Generate negative examples
np.random.seed(520)
neg_num = 10

def str_to_lst(data):
    data["pos"] = [data["pos"]]
    return data

# Sample negative texts
new_col = []
for i in range(len(ds)):
    ids = np.random.randint(0, len(ds), size=neg_num)
    while i in ids:
        ids = np.random.randint(0, len(ds), size=neg_num)
    neg = [ds[i.item()]["pos"] for i in ids]
    new_col.append(neg)
ds = ds.add_column("neg", new_col)
ds = ds.map(str_to_lst)

# Add prompt
instruction = "Represent this sentence for searching relevant passages: "
ds = ds.add_column("prompt", [instruction]*len(ds))
```

### 3. Split and Save Training Data
```python
# Split dataset
split = ds.train_test_split(test_size=0.1, shuffle=True, seed=520)
train, test = split["train"], split["test"]

# Save training data
train.to_json("ft_data/training.json")
```

### 4. Prepare Evaluation Data
```python
# Prepare queries
queries = test.select_columns(["id", "query"])
queries = queries.rename_column("query", "text")

# Prepare corpus
corpus = ds.select_columns(["id", "pos"])
corpus = corpus.rename_column("pos", "text")

# Prepare qrels (query-document relevance)
qrels = test.select_columns(["id"])
qrels = qrels.rename_column("id", "qid")
qrels = qrels.add_column("docid", list(test["id"]))
qrels = qrels.add_column("relevance", [1]*len(test))

# Save evaluation data
queries.to_json("ft_data/test_queries.jsonl")
corpus.to_json("ft_data/corpus.jsonl")
qrels.to_json("ft_data/test_qrels.jsonl")
```

## Important Notes
- Negative examples are crucial for training embedding models
- If no negative texts are available, randomly sample from corpus
- The prompt will be used as `query_instruction_for_retrieval` during inference
- Evaluation data requires three components: queries, corpus, and relevance relationships (qrels)